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INTRODUCTION 

This  study  investigated  modular  and  ensemble  systems  of  machine  learning 
methods  for  computer-aided  diagnosis  (CAD)  of  breast  cancer  to  reduce  the  number  of 
benign  biopsies.  While  mammography  is  valuable  for  early  detection  of  breast  cancer,  it 
has  a  high  false-positive  rate.  A  CAD  system  for  identifying  very  likely  benign  lesions  as 
candidates  for  follow-up  instead  of  biopsy  could  spare  women  discomfort,  anxiety,  and 
expense  and  potentially  improve  the  cost-effectiveness  of  mammographic  screening 
programs. 

This  predoctoral  fellowship  covers  two  different  students  both  mentored  by 
Joseph  Lo.  It  was  originally  awarded  to  Mia  Markey,  who  graduated  in  2002  from  Duke 
University  with  her  Ph.D.  Modular  Machine  Learning  Methods  for  Computer-Aided 
Diagnosis  of  Breast  Cancer.  The  original  aims  were  concluded  as  part  of  that  dissertation 
research.  As  noted  in  last  year’s  report,  the  Army  authorized  the  transfer  of  the  remaining 
fellowship  to  Jonathan  Jesneck.  We  proposed  new  aims  4  and  5  based  on  the  success  as 
well  as  difficulties  discovered  previously.  Consistent  with  those  aims,  Mr.  Jesneck  has 
developed  ensemble  classifiers  for  the  task  of  computer-aided  diagnosis  of  breast 
microcalcification  clusters,  which  are  very  challenging  to  characterize  for  radiologists 
and  computer  models  alike.  The  rationale  and  progress  for  these  aims  is  summarized  in 
the  report  below. 

BODY 

The  data  consisted  of  mammographic  features  extracted  by  automated  image 
processing  algorithms.  The  same  cases  were  used  as  described  in  last  year’s  report. 

Task  1.  Identify  subsets  of  the  training  data  using  both  a  priori  information 
and  unsupervised  learning  methods. 

The  database  of  digitized  mammograms  has  already  been  created  and  analyzed,  as 
described  in  a  previous  year’s  report.  The  most  important  grouping  was  mass  vs. 
calcification  lesions.  In  particular,  both  radiologists  and  computer  models  performed  far 
worse  when  attempting  to  characterize  the  calcification  lesions,  which  motivated  the 
current  emphasis  on  these  types  of  lesions  (see  #1,  #2,  and  #3  in  Reportable  Outcomes). 
This  aim  is  now  concluded. 

Task  2.  Build  local  models  for  breast  cancer  prediction  for  each  subset  of  the 
training  data  using  supervised  learning  methods.  Evaluate  the  performance  of  the 
local  models  on  the  training  data  relative  to  a  single,  global,  supervised  learning 
model  and  to  current  clinical  practice. 

This  task  has  already  been  completed  and  resulted  in  a  publication  (see  #2  and  #3 
in  Reportable  Outcomes),  as  described  in  a  previous  year’s  report.  With  regards  to  the 
challenging  calcification  cases,  no  local  model  was  able  to  outperform  the  simple,  single 
global  model. 

Task  3.  Combine  the  local  models  to  form  a  global,  modular  model.  Evaluate 
the  performance  of  the  modular  model  on  the  evaluation  data  set  relative  to  a  single, 
global,  supervised  learning  model  and  to  current  clinical  practice. 

This  task  has  also  been  completed  and  published  (see  #2  and  #4  in  Reportable 
Outcomes),  as  described  in  a  previous  year’s  report.  The  combination  of  modular  models 
did  not  outperform  the  simpler,  single  global  model.  This  negative  result  was  attributed  in 
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part  to  the  weaker  performance  over  the  challenging  calcification  cases.  This  again 
motivated  the  current  work. 

Task  4:  For  most  challenging  subset  of  the  data,  the  microcalcification 
lesions,  extract  image-based  features  using  a  fully  automated  CAD  scheme. 

This  task  involved  two  major  efforts.  First,  whereas  all  previous  aims  focused  on 
radiologist-interpreted  findings,  here  we  extracted  features  from  the  digitized 
mammograms  using  a  computer-aided  detection  (CAD)  algorithm.  These  new  features 
were  incorporated  into  our  first  ensemble  models,  each  based  upon  a  different  subset  of 
features  such  as  radiologist-interpreted,  image  processing,  and  patient  history.  Initial 
results  were  presented  and  published,  as  described  in  last  year’s  report.  The  calcification 
data  set  consisted  of  1508  lesions,  811  benign  and  697  malignant. 

Calcification  detection  and  segmentation 

Our  algorithm  used  a  matched  difference-of-Gaussian  (DoG)  filter  to  detect 
microcalcifications.  The  DoG  filter  selected  circular  bright  areas,  which  detected 
calcifications  well  and  did  not  detect  bright  vessels  in  the  image,  as  had  been  the  case 
with  earlier  histogram-based  calcification  detection  methods. 

A  Gaussian  mixture-model  density  estimation  technique  was  used  to  segment 
automatically  the  outline  of  the  individual  calcifications.  This  technique  assumed  the 
background  pixels  to  be  distributed  from  one  Gaussian  density,  and  the  calcification 
pixels  from  another  Gaussian  density.  An  iteratively  reweighted  least  squares  technique 
was  implemented  in  order  to  fit  the  mixture  densities.  Then  an  optimal  threshold  was 
chosen  to  separate  the  background  pixels  from  the  calcification  pixels. 

Morphological  features 

Once  the  calcifications  were  detected  and  properly  segmented,  the  algorithm  then 
extracted  morphological  features.  The  morphological  descriptors  of  the  individual 
calcifications  were  area,  mean  density  above  the  background  density,  eccentricity,  and 
the  number  of  calcifications  in  the  cluster.  These  features  have  been  shown  to  aid  in  CAD 
schemes  for  calcification  clusters  [3],  The  cluster  morphological  features  were  summaries 
of  the  individual  calcification  morphological  feature  values.  These  summary  statistics 
were  the  minimum,  maximum,  average,  and  standard  deviation.  This  resulted  in  a  total  of 
13  morphological  cluster  features. 

Texture  features 

It  has  been  shown  that  by  erasing  the  calcifications  from  the  lesion  image, 
important  texture  information  can  be  extracted  from  the  background  anatomy  [3].  Our 
algorithm  characterized  the  texture  features  of  the  anatomical  background  of  the  lesion 
ROI,  with  the  calcifications  removed.  Once  the  calcifications  had  been  properly  detected 
and  segmented,  they  were  erased  smoothly  by  bilinear  interpolation  from  the  image. 
Figures  1-3  show  the  progression  of  a  sample  ROI  from  raw  data  to  calcification 
segmentation  to  calcification  erasure  for  texture  analysis. 

The  spatial  gray-level  dependence  (SGLD)  matrix  was  used  to  calculate  texture 
features.  The  SGLD  is  the  joint  probability  of  the  occurrence  of  gray  levels  for  pixel  pairs 
which  are  separated  by  a  particular  distance  and  at  a  particular  angle  [4],  The  13  SGLD 
or  co-occurrence  matrix  features  are  correlation,  entropy,  energy  (angular  second 
moment),  inertia,  inverse  difference  moment,  sum  average,  sum  entropy,  sum  variance, 
difference  average,  difference  entropy,  difference  variance,  information  measure  of 
correlation  1,  and  information  measure  of  correlation  2.  The  SGLD  matrices  were 
calculated  over  the  bounding  boxes  of  the  detected  microcalcification  clusters.  A  set  of 
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13  SGLD  features  are  calculated  for  each  combination  of  distance  and  angle.  We 
considered  all  possible  combinations  of  the  angles  {0,  45,  90,  135}  degrees  with  the 
distances  {1,  5,  10,  15,  25}  pixels.  Overall,  this  yielded  260  texture  features. 


Fig.  1 :  ROI  of  calcification  cluster  Fig.  2:  Segmented  calcifications  Fig.  3:  ROI  with  calcifications  erased 


Task  5:  Develop  ensemble  models  for  predicting  benign  vs.  malignant 
calcification  clusters. 

For  ensemble  modeling,  the  feature  sets  (morphology,  texture,  BI-RADS,  and 
patient  age)  were  combined  in  various  models  in  order  to  assess  the  relative  predictive 
power  of  the  feature  sets.  Linear  discriminant  analysis  (LDA)  and  artificial  neural 
networks  (ANNs)  were  used  as  models.  The  LDAs  modeled  the  linear  trends  of  the  data, 
and  the  nonlinear  trends  can  be  seen  in  the  performance  difference  between  the  LDA  and 
ANN  classification  performances. 

The  large  number  of  input  features  often  hampered  model  training.  Therefore 
feature  selection  was  used.  For  the  LDA,  stepwise  feature  selection  using  the  Akaike 
Information  Criterion  (AIC)  and  Bayesian  Information  Criterion  (BIC)  were  used.  By 
penalizing  the  model  size,  the  feature  selection  methods  found  an  optimal  compromise 
between  model  goodness  of  fit  and  model  complexity.  Tables  1  and  2  show  the  areas 
under  the  ROC  curves  for  the  top  five  LDAs  and  top  five  ANNs,  sorted  in  order  of 
decreasing  testing  performance.  There  were  no  apparent  trends  in  which  combinations  of 
feature  subsets  were  the  best.  The  best  LDA  model  used  morphology,  BI-RADS,  and 
patient  age.  The  best  ANN  used  only  BI-RADS  features  and  yielded  the  best  performance 
overall  with  ROC  area  of  0.754±0.038. 


Table  1:  The  top  5  LDAs _  _ Table  2:  The  top  5  ANNs 


Training 

LOOCV 

Testing 

Features 

Training 

10-fo!d  CV 
Testing 

Features 

0.711 

0.692±0.014 

M,  B,  P 

0.778 

0.754±0.038 

B 

0.708 

0.690±0.015 

M,  B 

0.751 

0.734±0.032 

M,  B 

0.845 

0.686±0.014 

M,  T,  B,  P 

0.762 

0.733±0.022 

M,  B,  P 

0.843 

0.686+0.014 

M,  T,  B 

0.717 

0.705±0.038 

B,  P 

0.833 

0.677±0.016 

T,  B,  P 

0.901 

0.704±0.022 

T,  B 

The  values  plotted  are  areas  under  the  ROC  curves. 

Legend:  B  =  BI-RADS,  M  =  morphology,  P  =  patient  age,  T  =  texture 
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The  top  ANN  outperformed  the  top  LDA  (p-val  <  0.001)  over  the  entire  range  of 
the  ROC  curves,  as  shown  in  Figure  4.  This  is  a  very  unusual  situation,  as  in  our 
experience  with  breast  CAD  systems,  rarely  do  ANNs  actually  outperform  the  simple  but 
robust  LDA  models. 


ROC  curves  for  best  LDA  and  best  ANN 


Fig.  4:  ROC  curves  for  the  top  LDA  and  top  ANN 

KEY  RESEARCH  ACCOMPLISHMENTS 

•  Developed  a  new  algorithm  for  microcalcification  detection  with  high  sensitivity 

•  Developed  a  very  accurate  calcification  border  segmentation  algorithm 

•  Developed  morphological  features  for  microcalcifications  and  microcalcification 
clusters 

•  Developed  texture  features  for  lesion  ROIs  with  the  calcifications  removed 

•  Developed  a  fully  automated  ensemble  CAD  system  to  detect  microcalcification 
clusters 

•  Developed  a  nonlinear  ANN  predictive  model  which  was  statistically 
significantly  better  than  the  widely  used  linear  discriminant  models 

CONCLUSIONS 

We  developed  ensemble  systems  of  machine  learning  methods  for  computer- 
aided  diagnosis  (CAD)  of  breast  cancer  to  reduce  the  number  of  benign  biopsies.  We 
focused  in  particular  on  microcalcification  lesions,  which  are  much  more  difficult  to 
classify  than  masses.  Taking  advantage  of  nonlinearities  within  our  large  dataset,  the 
ANNs  were  able  to  fit  and  classify  the  data  better  than  the  LDAs.  The  BI-RADS  features 
were  the  strongest  in  terms  of  classifier  performance.  Unfortunately  the  texture  features 
did  not  contribute  greatly  to  the  classifiers,  which  could  be  due  to  the  significant  added 
noise  introduced  by  the  mammogram  film-digitizing  scanner. 

This  project  has  built  the  framework  for  general  data  fusion  techniques  for  breast 
cancer  diagnosis  and  has  led  Mr.  Jesneck  to  expand  this  research  area  in  his  new  Army 
Breast  Cancer  Predoctoral  Fellowship:  A  Computer-Aided  Diagnosis  System  for  Breast 
Cancer  Combining  Mammography  and  Genomics. 
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Abstract 

Perceptrons  are  typically  trained  to  minimize  mean  square  error  (MSE).  In  computer-aided  diagnosis  (CAD), 
model  performance  is  usually  evaluated  according  to  other  more  clinically  relevant  measures.  The  purpose  of 
this  study  was  to  investigate  the  relationship  between  MSE  and  the  area  ( A: )  under  the  receiver  operating 
characteristic  (ROC)  curve  and  the  high-sensitivity  partial  ROC  area  (o.9o4, )•  A  perceptron  was  used  to  predict 
lesion  malignancy  based  on  two  mammographic  findings  and  patient  age.  For  each  performance  measure,  the 
error  surface  in  weight  space  was  visualized.  Comparison  of  the  surfaces  indicated  that  minimizing  MSE 
tended  to  maximize  A:,  but  not  o.w^z-  ©  2002  Elsevier  Science  Ltd.  All  rights  reserved. 


Keywords:  Computer-aided  diagnosis;  Perceptron;  Neural  network;  Breast  cancer;  Error  surface 


1.  Introduction 

While  mammography  is  very  sensitive  at  detecting  breast  cancer,  its  specificity  is  low.  Only  15 
-34%  of  non-palpable,  mammographically  suspicious  lesions  are  found  to  be  malignant  at  biopsy 
[1,2].  The  excessive  number  of  benign  breast  biopsies  raises  the  overall  cost  of  mammographic 
screening  to  society  [3]  and  results  in  emotional  and  physical  burden  to  the  patients.  One  goal  of 
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the  application  of  computer-aided  diagnosis  (CAD)  to  mammography  is  the  reduction  of  this  false 
positive  rate. 

In  recent  years,  many  breast  cancer  CAD  studies  have  focused  on  the  use  of  artificial  neural 
network  (ANN)  models.  ANN  models  have  been  developed  to  predict  malignancy  among  suspicious 
breast  lesions  based  upon  mammographic  and  history  findings  [4-8],  Most  networks  for  CAD  are 
based  on  classic  feed-forward,  error-backpropagation  paradigms,  which  are  trained  to  minimize  mean 
squared  error  (MSE)  using  a  gradient  descent  technique.  In  “weight  space”,  the  ANN  modifies 
a  vector  of  weights,  descending  down  a  multi-dimensional  error  surface  in  search  of  the  global 
minimum  in  MSE.  Once  trained,  however,  these  ANNs  are  often  evaluated  according  to  other  more 
clinically  relevant  measures  of  performance  from  receiver  operating  characteristic  (ROC)  analysis. 
Such  measures  include  the  ROC  area  index  (Az )  and  the  partial  area  index  (0.90^)  corresponding  to 
the  portion  of  the  ROC  curve  in  the  high  sensitivity  range  of  0. 9-1.0  [9,10],  (More  information  on 
the  0.90 Az  is  provided  in  the  Methods  section.) 

The  relationship  between  these  three  performance  measures  is  not  well  defined,  but  there  is  a 
generally  unstated  assumption  that  a  classifier  trained  to  optimize  MSE  will  also  tend  to  optimize 
other  measures  such  as  Az  and  0.90^.  The  validity  of  that  assumption  was  questioned  in  recent 
studies.  In  one  study,  Kupinski  et  al.  compared  the  performance  of  neural  network  models  trained  in 
the  conventional  manner  (i.e.,  minimize  MSE)  vs.  those  trained  by  a  niched  Pareto  multi-objective 
genetic  algorithm  (NP-GA)  which  simultaneously  maximized  sensitivity  and  specificity  [11],  Using 
simulated  XOR  (exclusive  or)  data,  they  found  that  the  ROC  curve  generated  by  NP-GA  training 
was  superior  to  that  resulting  from  conventional  training  for  both  a  perceptron  (logistic  discriminant) 
and  an  artificial  neural  network.  Kupinski  et  al.  also  compared  the  performance  of  a  conventionally 
trained  perceptron  to  a  NP-GA  trained  perceptron  for  the  task  of  breast  mass  detection  [12].  They 
found  that  while  there  was  no  significant  difference  between  the  models  in  terms  of  Az,  the  NP-GA 
trained  perceptron  was  significantly  better  in  terms  of  the  0.90 Az-  In  other  words,  the  weights  identified 
by  minimizing  the  MSE  were  inferior  to  those  identified  by  the  NP-GA  in  terms  of  the  model’s 
performance  at  high  sensitivities. 

A  related  study  demonstrated  that  different  feature  selection  techniques  might  be  preferred  when 
0.90 Az  is  considered  instead  of  Az.  Sahiner  et  al.  compared  the  performance  of  linear  discriminant 
analysis  (LDA)  classifiers  using  features  selected  by  an  LDA  technique  vs.  a  genetic  algorithm  (GA) 
[13].  The  former  provided  better  Az  but  the  latter  had  better  0.90^  • 

All  of  the  above  studies  examined  the  behavior  of  either  linear  or  logistic  discriminants.  Although 
highly  simplified  compared  to  ANNs,  these  techniques  are  important  for  several  reasons.  First,  their 
simplicity  allows  easy  analysis  of  the  relatively  few  parameters.  For  example,  previous  work  at  this 
institution  presented  a  typical  ANN  for  breast  cancer  CAD  with  16  inputs  and  10  hidden  nodes, 
characterized  by  180  weight  parameters  [14].  In  comparison,  the  highly  simplified  perceptrons  in 
this  study  were  characterized  by  only  four  weights. 

Secondly,  several  authors  have  reviewed  recent  studies  where  ANNs  were  applied  to  CAD  prob¬ 
lems,  and  suggested  that  a  logistic  model  (such  as  a  perceptron)  would  have  likely  provided  similar 
performance  while  avoiding  over-fitting  problems  [15,16].  Indeed,  many  recent  studies  in  the  field  of 
CAD  have  been  based  upon  linear  discriminant  models  [17-20].  Any  lessons  learned  from  optimizing 
perceptrons  would  thus  likely  be  useful  to  the  field  of  CAD  research. 

The  simple  architecture  of  perceptrons  is  crucial  to  this  study,  which  investigates  the  underlying 
behavior  of  these  models  by  studying  the  error  surfaces  formed  as  a  function  of  the  parametric 
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weights.  In  particular,  the  goal  is  to  compare  error  surfaces  resulting  from  measuring  performance 
with  MSE  vs.  Az  and  o.9oK- 


2.  Materials  and  methods 

2.1.  Data  set 

The  data  set  consisted  of  500  cases  of  non-palpable  breast  lesions  from  patients  who  had  undergone 
excisional  biopsy  at  Duke  University  Medical  Center  between  1991  and  1996.  In  other  words,  the 
data  set  consisted  of  a  consecutive  sample  of  actual  clinical  cases.  Of  these  500  lesions,  65%  were 
found  to  be  benign  as  a  result  of  histopathologic  diagnosis.  The  relatively  low  prevalence  of  disease 
in  this  data  set  is  consistent  with  the  literature  concerning  this  diagnostic  task  [1,2].  It  is  expected 
that  models  built  on  a  clinically  representative  case  mix  will  be  better  prepared  to  classify  previously 
unseen  clinical  cases.  The  method  of  encoding  the  lesion  descriptors  has  been  previously  described 
[14],  and  will  only  be  summarized  here.  Expert  radiologists  retrospectively  reviewed  the  patient 
films  and  recorded  ten  mammographic  findings  according  to  the  Breast  Imaging  and  Reporting  Data 
System  (BI-RADS™)  lexicon  [21],  as  well  as  other  patient  history  data  including  the  age.  These 
findings  were  encoded  into  numeric  values  and  used  as  input  features  in  order  to  predict  the  known 
biopsy  outcome  of  benign  vs.  malignant. 

2.2.  Network  architecture 

Even  with  the  simplified  architecture  of  a  perceptron,  it  was  still  important  to  reduce  the  dimen¬ 
sionality  of  the  input  features  in  order  to  permit  visualization  and  analysis.  The  number  of  inputs 
was  therefore  pruned  to  the  three  most  important  ones,  based  upon  previous  work  in  identifying 
the  most  important  input  findings  for  this  diagnostic  problem  [14,22],  The  BI-RADS™  findings 
used  were  mass  margin  and  calcification  morphology.  In  addition,  a  single  patient  history  variable, 
age,  was  used.  All  features  were  scaled  to  the  range  of  0-1.  This  3-input  perceptron  is  shown  in 
Fig.  1.  The  perceptron  had  one  weight  per  input  (Wl,  W2,  and  W3)  and  a  bias  term  (W4).  The  dot 
product  of  input  vector  and  the  weight  vector  is  passed  through  a  non-linear  activation  function  to 
produce  the  output.  The  inputs  were  the  two  BI-RADS™  findings,  calcification  morphology  (weight 
Wl)  and  mass  margin  (weight  W2),  and  patient  age  (weight  W3).  The  outputs  of  the  perceptron 
range  from  0,  which  indicates  a  benign  lesion,  to  1,  which  indicates  a  malignant  lesion.  Perceptron 
learning  parameters  were  empirically  optimized  to  minimize  MSE:  learning  rate  and  momentum  of 
0.05  and  1000  iterations,  with  each  iteration  defined  as  a  complete  presentation  of  all  training  cases 
with  weight  adjustment  after  each  case. 

2.3.  Error  surface  analysis 

In  weight  space,  each  weight  defines  a  dimension.  Each  point  in  the  four-dimensional  weight  space 
represents  a  vector  of  weight  values  that  define  a  distinct  perceptron.  When  this  perceptron  is  applied 
to  a  data  set  of  input  cases,  the  resulting  MSE  or  other  measures  of  performance  are  functions  of 
the  weights  defining  that  perceptron.  The  error  surface  is  the  surface  formed  by  evaluating  the  MSE 
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Fig.  1.  Architecture  of  the  perceptron.  The  dot  product  of  the  input  vector  (calcification  morphology,  mass  margin,  age, 
and  bias)  and  the  weight  vector  (Weight  1,  Weight  2,  Weight  3,  and  Weight  4)  is  passed  through  a  non-linear  activation 
function  (/(*))  to  produce  the  output  (F). 


as  a  function  of  a  range  of  weight  values  in  each  dimension.  Other  measures  of  performance,  such 
as  Az  and  o.w'C  can  be  used  to  form  different  surfaces  in  the  same  manner  as  the  error  surface. 
For  simplicity,  we  refer  to  all  such  surfaces  of  performance  measures  as  “error  surfaces”.  Notice 
that  plotting  the  error  surface  is  not  an  optimization  technique,  but  instead  is  used  to  show  general 
trends  in  the  data.  For  a  perceptron  with  only  two  weights,  the  error  surface  may  be  readily  plotted 
in  the  “z”  or  third  dimension.  In  the  current  study,  however,  two-dimensional  slices  of  the  error 
surface  are  plotted  instead  of  attempting  to  visualize  the  four-dimensional  error  surface.  In  a  slice, 
two  of  the  weights  are  varied  to  produce  the  surface,  while  the  other  two  weights  are  held  constant. 
Fig.  2  shows  an  example  of  an  error  surface  slice.  For  simplicity,  in  the  remainder  of  the  error 
surface  plots,  the  performance  function  will  be  plotted  as  intensity  as  in  Fig.  3A. 

To  generate  these  slices,  a  grid  search  through  weight  space  was  performed.  The  perceptron  with 
each  combination  of  weights  was  applied  to  the  data  set.  The  MSE,  ROC  area  (Az),  or  partial  area 
index  (o.9o^z)  of  each  perceptron  is  indicated  by  intensity.  Although  the  MSE  and  ROC  have  been 
reported  in  many  previous  studies,  the  0.9o A'z  is  relatively  less  well  studied.  0.90 A.  can  be  interpreted 
as  the  mean  specificity  of  the  model  over  the  given  high  sensitivity  range.  It  has  particular  clinical 
relevance  in  these  examples  of  breast  cancer  CAD,  where  it  is  much  more  important  to  optimize 
sensitivity  in  the  uppermost  portion  of  the  ROC  curve,  rather  than  specificity  in  the  leftmost  portion 
of  the  ROC  curve.  Note  that  while  lower  values  for  MSE  indicate  better  perfonnance,  higher  values 
for  the  performance  measures  Az  and  0.90^  indicate  better  performance. 

The  tpfo^z  was  defined  by  Jiang  et  al.  [10],  The  partial  area  is  the  area  under  the  ROC  curve 
from  a  given  sensitivity  (TPFo)  to  1.0,  where  TPFo  =  0.90  is  typically  used.  The  partial  area  index 
(Tpf0^z)  is  the  partial  area  normalized  by  dividing  by  the  constant  (1  -TPF0).  Note  that  the  optimal 
value  of  both  Az  and  0.90^  is  1 .0,  but  the  chance  behavior  is  0.5  for  Az  while  it  is  0.05  for  om)A'z 
at  TPFo  =  0.90.  The  ROC  analysis  was  performed  using  software  modified  and  provided  by  Charles 
Metz,  University  of  Chicago.  The  Az  and  0.90K  were  calculated  using  a  modified  version  of  the 
LABROC4  software,  which  finds  a  maximum  likelihood  estimate  of  the  area  from  a  fit  to  the  data. 
The  statistical  comparisons  were  calculated  using  a  modified  version  of  the  CLABROC  software, 
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Fig.  2.  A  MSE  surface  in  weight  space.  The  MSE  is  a  function  of  the  perceptron  weights  (Wl,  W2,  W3,  and  W4).  W1 
and  W4  were  held  constant. 
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Fig.  3.  The  MSE  surface  in  weight  space.  The  MSE  is  a  function  of  the  perceptron  weights  (Wl,  W2,  W3,  and  W4). 
The  MSE  is  shown  as  intensity.  Darker  gray  indicates  better  performance.  The  slices  through  MSE  surface  are  (A)  W3 
vs.  W2,  (B)  W3  vs.  Wl,  and  (C)  Wl  vs.  W2.  The  subplots  are  arranged  such  that  folding  them  into  a  box  provides  a 
way  to  visualize  three  of  the  weight  dimensions. 


which  finds  a  maximum  likelihood  estimate  of  the  areas  for  two  classifications  from  fits  to  the  two 
data  sets.  An  estimate  of  statistical  significance  is  reported  for  differences  between  the  fitted  curves. 
This  estimate  of  significance  includes  the  contribution  from  correlation  of  the  input  data. 
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The  grid  search  over  the  weights  was  done  in  the  vicinity  of  weights  identified  as  optimal  by 
training  a  perceptron  to  minimize  the  MSE  of  the  data  set.  In  other  words,  the  training  was  used  only 
to  narrow  down  the  reasonable  range  of  weights  over  which  the  grid  search  was  performed.  With 
learning  rate  and  momentum  of  0.05  and  1000  iterations,  the  final  weights  were  W1  =  1.65,  W2  = 
2.22,  W3  =  2.56,  and  W4  =  —  3.21.  In  order  to  simplify  the  visualization  further,  the  bias  weight  W4 
was  always  fixed  at  that  ‘central’  value.  Each  two-dimensional  slice  was  generated  by  varying  two 
of  the  feature  weights  while  the  bias  and  one  remaining  feature  weight  were  held  constant  at  the 
aforementioned  ‘central’  values.  The  three  combinations  resulted  in  an  “exploded  box”  showing  the 
three-dimensional  relationship  between  the  three  weights  Wl,  W2,  and  W3.  Each  weight  was  varied 
approximately  over  the  range  of  the  central  value  ±150%  of  the  central  value.  Wl  was  varied  from 
—  1.00  to  5.00.  W2  was  varied  from  —2.00  to  5.95.  W3  was  varied  from  —3.00  to  6.90. 


3.  Results 


3.1.  MSE  vs.  Az 


Fig.  3  shows  three  two-dimensional  slices  through  the  MSE  surface  and  Fig.  4  shows  three 
two-dimensional  slices  through  the  Az  surface.  Note  that  improved  performance  corresponds  to  min¬ 
imizing  MSE  (darker  grayscale  value)  but  maximizing  Az  (brighter  grayscale  value).  MSE  is  ex¬ 
pected  to  range  between  0  (perfect)  and  0.5  (chance  behavior),  while  Az  ranges  between  0.5  (chance) 
and  1  (perfect).  While  the  MSE  and  Az  surfaces  are  clearly  not  the  same,  the  minimum  observed 
on  the  MSE  surface  is  in  the  same  general  location  in  weight  space  as  the  maximum 


Fig.  4.  The  Az  surface  in  weight  space.  The  Az  is  a  function  of  the  perceptron  weights  (Wl,  W2,  W3,  and  W4).  The  Az 
is  shown  as  intensity.  Lighter  gray  indicates  better  performance.  The  slices  through  the  A.  surface  are  (A)  W3  vs.  W2, 
(B)  W3  vs.  Wl,  and  (C)  Wl  vs.  W2. 
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Fig.  5.  The  ooo^r  surface  in  weight  space.  The  aw  A,  is  a  function  of  the  pcrccptron  weights  (Wl,  W2,  W3,  and  W4). 
The  o.9oA:  is  shown  as  intensity.  Lighter  gray  indicates  better  performance.  The  slices  through  the  owA-  surface  are  (A) 
W3  vs.  W2,  (B)  W3  vs.  Wl,  and  (C)  Wl  vs.  W2. 


observed  on  the  Az  surface.  The  best  solution  corresponding  to  the  global  minimum  on  the  MSE 
surface,  i.e.  the  central  weights  (Wl  =  1.65,  W2  =  2.22,  W3  =  2.56,  and  W4  =  —3.21),  has  MSE 
of  0.41  and  Az  of  0.80  ±  0.02.  The  best  solution  corresponding  to  the  global  maximum  on  the  Az 
surface  (Wl  =  1.65,  W2  =  1.90,  W3  =  2.40,  W4  =  -3.21,  Fig.  4A)  has  MSE  of  0.41  and  Az  of 
0.80  ±  0.02.  The  difference  in  the  Az  between  the  solutions  was  not  statistically  significant  (two  tail 
p  =  0.14). 

3.2.  MSE  vs.  o9oAz 

Fig.  3  shows  three  two-dimensional  slices  through  the  MSE  surface  and  Fig.  5  shows  three 
two-dimensional  slices  through  the  o.90^z  surface.  There  is  less  correspondence  in  the  general  appear¬ 
ance  of  the  contours  between  the  MSE  and  o.90^z  surfaces  than  was  observed  between  MSE  and  Az 
surfaces.  The  solution  on  the  MSE  surface,  i.e.  the  central  weights  (Wl  =  1.65,  W2=2.22,  W3=2.56, 
and  W4  =  —3.21)  does  not  correspond  to  the  best  solution  corresponding  to  a  global  maximum  in 
the  0.9o4  surface  (Wl  =  3.35,  W2  =  2.22,  W3  =  5.70,  and  W4  =  -3.21,  Fig.  5B).  The  solution  on 
the  MSE  surface  has  MSE  of  0.41  and  0.90^  of  0.24  ±  0.05.  The  solution  on  the  o.90^z  surface  has 
MSE  of  0.58  and  0.90^  of  0.30  ±0.04.  The  difference  in  o.90^'z  between  the  solutions  was  statistically 
significant  (two  tail  p  =  0.006). 

This  same  trend  may  be  demonstrated  by  comparing  a  particular  operating  point,  such  as  the 
specificity  for  95%  sensitivity.  The  best  MSE  solution  resulted  in  a  specificity  of  25%  while  the  best 
specificity  solution  resulted  in  a  specificity  of  31%.  This  difference  in  specificity  at  95%  sensitivity 
was  again  statistically  significant  (p  —  0.002). 
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(A)  Perceplron  Oulpul 


(B)  Perceptron  Output 

Fig.  6.  Histograms  of  the  outputs  of  the  perceptron  for  the  weights  that  correspond  to  (A)  the  minimal  MSE  and  (B)  the 
maximal  omAz. 

The  difference  in  the  solutions  on  the  MSE  and  o.9o^z  surfaces  is  illustrated  by  comparing  the 
histograms  of  the  outputs  of  the  corresponding  perceptrons  (Fig.  6).  Since  the  o.90^z  measure 
describes  the  high  sensitivity  region  of  the  ROC  curve,  the  outputs  of  the  perceptron  with  the 
highest  o.90^z  tend  to  be  higher  than  the  outputs  of  the  perceptron  with  the  lowest  MSE. 


4.  Discussion 

The  three  metrics  of  performance  studied  here  are  important  for  different  reasons.  The  MSE  is 
the  metric  that  many  models  including  perceptrons  and  ANNs  attempt  to  optimize  directly,  while 
the  Az  and  o.9o^z  have  greater  clinical  significance.  Consider  the  histograms  (Fig.  6)  of  network 
outputs  of  benign  cases  and  malignant  cases,  where  the  network  output  of  “0”  indicates  a  benign 
lesion  and  “1”  indicates  a  malignant  lesion.  MSE  is  a  measure  of  the  how  close  the  distribution 
of  benign  cases  is  to  a  network  output  of  “0”  and  how  close  the  distribution  of  malignant  cases  is 
to  “1”.  The  area  under  the  ROC  curve  is  a  measure  of  the  overlap  of  the  distributions.  A  training 
scheme  that  minimizes  MSE,  and  so  pulls  the  distributions  to  the  edges,  can  also  reduce  the  overlap 
of  the  distribution,  and  so  increases  Az.  It  should  be  noted,  however,  that  the  MSE  can  decrease 
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without  an  accompanying  change  in  Az,  because  each  increment  in  Az  can  only  result  from  the 
reversal  of  position  for  an  adjacent  pair  of  benign  and  malignant  cases  in  the  histogram.  While  a 
full  convergence  to  MSE  =  0  will  also  result  in  Az  =  1,  the  latter  can  be  achieved  with  any  arbitrary 
MSE,  as  long  as  the  two  distributions  do  not  overlap  at  all.  In  the  current  study,  it  was  observed 
that  the  weights  that  minimized  MSE  also  maximized  Az. 

It  should  be  noted  that  in  this  study  an  ROC  curve  was  generated  by  applying  a  threshold  to  the 
output  node  of  the  perceptron.  By  comparison,  the  method  of  Woods  and  Bowyer  [23]  scales  the 
bias  weight  for  the  nodes  in  the  hidden  layer  of  an  artificial  neural  network.  Since  perceptrons  lack 
a  hidden  layer,  their  method  would  not  be  appropriate  here. 

In  recent  years,  the  sensitivity  of  breast  cancer  CAD  techniques  has  been  particularly  emphasized, 
since  there  is  a  considerably  greater  cost  in  missing  or  delaying  the  diagnosis  of  an  actual  cancer 
(false  negative)  compared  to  referring  a  benign  lesion  to  an  unnecessary  biopsy  (false  positive). 
For  a  range  of  sensitivities  (e.g.,  TPF0  from  0.9  to  1),  the  tpfo^z  can  be  thought  of  as  an  average 
specificity  [10].  As  an  aid  to  interpreting  these  surfaces,  it  is  helpful  to  note  that  for  low  values  of 
the  threshold  TPFo,  the  tpf0A'z  surface  resembles  the  Az  surface.  Conversely,  as  TPFo  increases,  the 
tpfo^z  surface  resembles  the  specificity  surface  at  a  given  high  sensitivity  level.  Unlike  MSE  and 
Az,  o.9o Az  is  not  symmetric  in  the  sense  that  false  negative  and  false  positive  cases  do  not  contribute 
to  the  measure  in  the  same  way.  In  this  work,  the  solution  on  the  o.90^z  surface  was  found  to  not 
correspond  well  with  the  MSE  solution.  It  should  be  noted  that  the  differences  in  the  weights  that 
optimize  MSE  vs.  0.9o^z  may  be  due  in  part  to  biases  inherent  to  the  reduced  amount  of  data  that 
is  associated  with  the  high  sensitivity  region  of  the  ROC  curve. 

If  it  is  thought  that  Az  is  a  suitable  measure  of  performance  of  CAD  systems  for  breast  cancer, 
then  this  work  can  be  interpreted  as  a  reassurance  that  classifiers  trained  to  minimize  MSE  may 
also  maximize  the  measure  of  interest.  This  provides  some  justification  for  avoiding  the  task  of 
attempting  to  directly  optimize  model  performance  according  to  Az.  Note  that  optimizing  for  Az  by 
gradient  descent  techniques  is  not  straightforward  since  Az  is  not  a  continuous  function. 

However,  if  o.90^z  corresponding  to  a  given  high  level  of  sensitivity  is  a  better  measure  of  the 
quality  of  CAD  systems  for  breast  cancer,  then  this  work  demonstrates  that  a  classifier  trained  to 
minimize  MSE  may  provide  an  inferior  solution.  Alternative  methods  of  identifying  good  weights  for 
a  perceptron  or  multi-layer  network  should  be  considered,  such  as  evolutionary  computing  techniques 
that  employ  stochastic  optimization.  Our  conclusions  are  consistent  with  related  previous  work  that 
compared  optimization  techniques.  As  described  in  the  introduction,  Kupinski  et  al.  found  that  using 
a  perceptron  (logistic  discriminant)  trained  by  a  genetic  algorithm  instead  of  a  classically  trained 
perceptron  resulted  in  no  significant  change  in  Az,  but  a  significant  improvement  in  o.90^z  [12]. 


5.  Summary 

Perceptrons,  like  more  complicated  backpropagation  artificial  neural  networks,  are  typically  trained 
to  minimize  mean  square  error  (MSE).  In  computer-aided  diagnosis  (CAD)  applications,  model 
performance  is  usually  evaluated  according  to  other  more  clinically  relevant  measures  from  receiver 
operating  characteristic  (ROC)  analysis.  The  purpose  of  this  study  was  to  investigate  the  relationship 
between  MSE  and  the  area  (Az)  under  the  ROC  curve  and  the  partial  ROC  area  (0.90^)  ™der  the 
high  sensitivity  portion  of  the  ROC  curve.  A  perceptron  was  used  to  predict  whether  or  not  breast 


108 


M.K.  Markey  et  a!.  I  Computers  in  Biology  and  Medicine  32  (2002)  99-109 


lesions  were  malignant  based  on  two  mammographic  findings  and  patient  age.  For  each  performance 
measure,  the  error  surface  in  weight  space  was  visualized.  Comparison  of  the  surfaces  indicated  that 
minimizing  MSE  tended  to  maximize  Az,  but  not  o.90^'z-  If  it  is  important  to  maximize  o.90^z,  then 
predictive  models  trained  to  minimize  MSE  may  provide  inferior  solutions. 
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Differences  between 
Computer-aided  Diagnosis  of 
Breast  Masses  and  That  of 
Calcifications 1 


PURPOSE:  To  compare  the  performance  of  a  computer-aided  diagnosis  (CAD) 
system  for  diagnosis  of  previously  detected  lesions,  based  on  radiologist-extracted 
findings  on  masses  and  calcifications. 

MATERIALS  AND  METHODS:  A  feed-forward,  back-propagation  artificial  neural 
network  (BP-ANN)  was  trained  in  a  round-robin  (leave-one-out)  manner  to  predict 
biopsy  outcome  from  mammographic  findings  (according  to  the  Breast  Imaging 
Reporting  and  Data  System)  and  patient  age.  The  BP-ANN  was  trained  by  using  a 
large  (>1,000  cases)  heterogeneous  data  set  containing  masses  and  microcalcifica¬ 
tions.  The  performances  of  the  BP-ANN  on  masses  and  microcalcifications  were 
compared  with  use  of  receiver  operating  characteristic  analysis  and  a  z  test  for 
uncorrelated  samples. 

RESULTS:  The  BP-ANN  performed  significantly  better  on  masses  than  microcalci¬ 
fications  in  terms  of  both  the  area  under  the  receiver  operating  characteristic  curve 
and  the  partial  receiver  operating  characteristic  area  index.  A  similar  difference  in 
performance  was  observed  with  a  second  model  (linear  discriminant  analysis)  and 
also  with  a  second  data  set  from  a  similar  institution. 

CONCLUSION:  Masses  and  calcifications  should  be  considered  separately  when 
evaluating  CAD  systems  for  breast  cancer  diagnosis. 
c  RSNA,  2002 


Among  American  women,  breast  cancer  is  the  most  common  cancer  and  is  the  second 
leading  cause  of  cancer  deaths  (1).  Women  in  the  United  States  have  about  a  1  in  8  lifetime 
risk  of  developing  invasive  breast  cancer  (2,3).  Mammographic  screening  has  been  shown 
to  reduce  the  mortality  of  breast  cancer  by  as  much  as  30%  (4,5).  However,  mammography 
has  a  low  positive  predictive  value  (PPV).  Approximately  35%  or  less  of  women  who 
undergo  biopsy  for  histopathologic  diagnosis  of  breast  cancer  are  found  to  have  malig¬ 
nancies  (6).  One  goal  of  the  application  of  computer-aided  diagnosis  (CAD)  to  mammog¬ 
raphy  is  to  reduce  the  false-positive  rate.  Avoiding  benign  biopsies  spares  women  unnec¬ 
essary  discomfort,  anxiety,  and  expense. 

CAD  of  breast  cancer  is  the  application  of  computational  techniques  to  the  problem  of 
interpreting  breast  images,  usually  mammograms  (7-9).  There  are  two  major  topics  in 
breast  cancer  CAD:  detection  of  mammographic  lesions  and  diagnosis  of  cancer  from 
identified  lesions.  In  the  detection  task,  the  goal  is  to  assist  a  radiologist  in  the  identifi¬ 
cation,  and  often  the  localization,  of  lesion-containing  regions  of  mammograms.  In  the 
diagnosis  task,  the  goal  is  to  assist  a  radiologist  in  determining  whether  an  identified  breast 
lesion  is  an  indication  of  cancer.  This  study  focused  on  the  diagnosis  of  breast  lesions  that 
had  already  been  identified  by  radiologists  as  suspicious  enough  to  warrant  biopsy.  In 
other  words,  these  cases  are  generally  considered  indeterminate  and  more  challenging, 
and  any  reduction  in  the  number  of  benign  biopsies  represents  an  improvement  over  the 
status  quo,  provided  high  sensitivity  is  maintained. 

Most  breast  biopsy  is  performed  on  lesions  that  manifest  mammographically  as  either  a 
mass  or  a  cluster  of  microcalcifications  (10).  CAD  systems  for  detection  generally  perform 
better  on  calcifications  than  on  masses,  as  shown  in  two  review  articles  (8,11)  and  a  recent 
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study  from  a  commercial  CAD  vendor 
(12).  CAD  systems  for  diagnosis  that  are 
based  on  features  automatically  extracted 
from  the  images  are  typically  designed 
for  either  masses  or  calcifications  alone. 
We  are  unaware  of  any  previous  attempts 
to  compare  the  performance  on  masses 
and  calcifications  within  a  single  study. 
Given  the  differences  in  databases  and 
techniques  with  CAD  systems  for  diagno¬ 
sis,  direct  comparison  of  the  published 
performances  on  masses  and  calcifica¬ 
tions  is  not  possible.  However,  the  au¬ 
thors  of  classification  studies  on  masses 
(13,14)  report  performances  that  are  bet¬ 
ter  than  those  reported  in  studies  on  cal¬ 
cifications  (15,16).  CAD  systems  for  diag¬ 
nosis  that  are  based  on  findings  extracted 
by  radiologists  are  often  trained  and  eval¬ 
uated  over  heterogeneous  data  sets  in¬ 
cluding  both  masses  and  calcifications, 
and  the  performances  on  masses  and  cal¬ 
cifications  are  not  reported  separately 
(17-20).  The  purpose  of  our  study  was  to 
compare  the  performance  of  a  CAD  sys¬ 
tem  for  diagnosis  of  already  detected  le¬ 
sions,  based  on  radiologist-extracted 
findings  on  masses  and  calcifications. 

MATERIALS  AND  METHODS 
Data 

Original  studies  were  performed  in  ac¬ 
cordance  with  standard  clinical  indica¬ 
tions.  All  data  from  human  subjects  were 
collected  with  approval  from  appropriate 
institutional  review  boards,  which  also 
waived  the  requirement  for  informed  pa¬ 
tient  consent. 

We  collected  data  on  1,530  nonpalpable 
mammographically  suspicious  breast  le¬ 
sions  on  which  biopsy  (core  or  exci- 
sional)  was  performed  from  1990  to  2000 
at  Duke  University  Medical  Center.  The 
data  were  collected  over  several  discon¬ 
tinuous  time  periods,  but  were  collected 
consecutively  within  each  time  period. 
Of  the  1,530  cases,  61  were  removed  be¬ 
cause  it  was  not  certain  that  they  were 
nonpalpable.  In  addition,  16  cases  were 
removed  because  the  radiologist's  assess¬ 
ment  of  the  likelihood  of  malignancy 
was  unavailable.  Thus,  the  primary  data 
consisted  of  1,453  approximately  consec¬ 
utive,  nonpalpable,  mammographically 
suspicious  breast  lesions.  Experienced 
mammographers  summarized  each  case 
according  to  the  Breast  Imaging  Report¬ 
ing  and  Data  System  (BI-RADS)  lexicon 
(21).  Each  of  the  cases  was  read  by  one  of 
seven  readers.  The  475  cases  collected 
from  1990  to  1996  were  read  retrospec¬ 
tively,  and  the  978  cases  collected  from 
1996  to  2000  were  read  prospectively. 


Of  the  1,453  cases,  508  (35%)  were 
found  to  be  malignant  at  biopsy.  For  the 
purposes  of  this  study,  a  case  was  consid¬ 
ered  a  "mass  case"  if  mass  features  were 
present  and  no  values  were  missing  for 
any  of  the  mass  or  calcification  features. 
Likewise,  a  case  was  considered  a  "calci¬ 
fication  case"  if  calcification  features 
were  present,  but  no  mass  features  were 
present,  and  no  values  were  missing  for 
any  of  the  mass  or  calcification  features. 
There  were  615  cases  with  masses,  includ¬ 
ing  65  cases  with  calcifications  in  addi¬ 
tion  to  a  mass.  There  were  622  cases  with 
calcifications  that  did  not  have  masses  as 
well.  The  PPVs  for  the  mass  cases  (223/ 
615  =  36%)  and  the  calcification  cases 
(209/622  =  34%)  were  similar  (P  =  .65,  x2 
test  for  independence;  95%  Cl  for  malig¬ 
nancy  fraction  =  -0.027,  0.080).  The  re¬ 
maining  216  cases  consisted  of  cases  with 
neither  a  mass  nor  calcifications  (n  = 
132)  and  cases  with  incomplete  descrip¬ 
tions  of  the  mass  or  calcifications  that 
were  present  ( n  =  84).  A  mass  was  con¬ 
sidered  incompletely  described  if  there 
were  missing  values  for  some  of  the  mass 
or  calcification  features.  Likewise,  a  calci¬ 
fication  was  considered  incompletely  de¬ 
scribed  if  there  were  missing  values  for 
some  of  the  calcification  features.  The 
cases  without  a  mass  or  calcifications 
were  described  by  other  findings,  such  as 
architectural  distortion.  When  the  value 
was  missing  for  a  feature,  it  was  encoded 
in  the  same  manner  as  if  the  finding  was 
not  present.  AH  1,453  cases,  including 
the  216  cases  with  neither  a  mass  nor 
calcifications,  were  used  in  building  the 
CAD  models  for  diagnosis. 

A  second  data  set  consisted  of  1,000 
consecutive  mammographically  suspi¬ 
cious  breast  lesions  on  which  excisional 
biopsy  was  performed  from  1990  to  1997 
at  the  University  of  Pennsylvania  Medi¬ 
cal  Center.  Experienced  mammographers 
summarized  each  case  according  to  the 
BI-RADS  lexicon  (21).  Each  of  the  cases 
was  read  retrospectively  by  one  of  11 
readers.  Of  the  1,000  cases,  396  (40%) 
were  found  to  be  malignant  at  biopsy. 
There  were  481  cases  with  masses,  includ¬ 
ing  10  cases  with  calcifications  in  addi¬ 
tion  to  a  mass.  There  were  449  cases  with 
calcifications  that  did  not  also  have 
masses.  The  PPV  observed  for  the  masses 
(191/481  =  40%)  was  the  same  as  that  for 
the  calcifications  (178/449  =  40%).  There 
were  70  other  cases,  most  ( n  =  68)  of 
which  were  cases  with  incompletely  de¬ 
scribed  masses  or  calcifications.  All  1,000 
cases,  including  the  incompletely  de¬ 
scribed  ones,  were  used  in  training  the 
CAD  models  for  diagnosis. 


Specifically,  the  BI-RADS  features  col¬ 
lected  were  mass  margin,  mass  shape,  mass 
density,  mass  size,  calcification  morphol¬ 
ogy,  calcification  distribution,  and  associ¬ 
ated  and  special  findings.  Although  not  a 
part  of  the  BI-RADS  specification,  the  num¬ 
ber  of  calcifications  is  routinely  collected  at 
both  institutions  and  was  also  included. 
The  number  of  calcifications  was  indicated 
as  no  calcifications  present,  fewer  than 
five,  five  to  10,  or  more  than  10  calcifica¬ 
tions  present.  The  location  of  the  lesion 
was  also  included  and  was  encoded  as  pos¬ 
terior,  central,  axillary  tail,  subareolar, 
lower  inner  quadrant,  lower  outer  quad¬ 
rant,  upper  inner  quadrant,  or  upper  outer 
quadrant. 

In  addition  to  the  BI-RADS  findings, 
patient  age  was  collected.  For  the  cases 
from  Duke  University  Medical  Center, 
the  mean  age  was  56  years,  with  a  range 
of  23-87  years.  For  the  cases  from  the 
University  of  Pennsylvania  Medical  Cen¬ 
ter,  the  mean  age  was  55  years,  with  a 
range  of  1 7-92  years.  Age  is  known  to  be 
an  important  risk  factor  for  breast  cancer. 
Increasing  age  is  associated  with  increas¬ 
ing  risk  of  breast  cancer;  a  60-year-old 
white  American  woman  has  a  14-fold  in¬ 
crease  in  her  chances  of  developing  breast 
cancer  relative  to  a  30-year-old  white 
American  woman  (5).  In  agreement  with 
the  epidemiologic  data,  some  evidence  ex¬ 
ists  that  age  is  a  particularly  valuable  input 
in  our  predictive  models  (22). 

For  the  cases  from  Duke  University 
Medical  Center,  the  mammographers  in¬ 
dicated  on  a  scale  of  1-5  their  assessment 
of  the  likelihood  of  malignancy.  These 
assessment  data  were  not  available  for 
the  cases  collected  at  the  University  of 
Pennsylvania  Medical  Center.  An  assess¬ 
ment  of  1  indicated  benign  findings;  2, 
likely  benign  findings;  3,  indeterminate 
findings;  4,  likely  malignant  findings; 
and  5,  malignant  findings.  The  mam- 
mographer's  assessment  of  malignancy 
was  collected  at  the  same  time  as  the 
BI-RADS  descriptors.  As  mentioned,  some 
of  the  cases  were  read  retrospectively  and 
some  were  read  prospectively,  and  al¬ 
though  several  mammographers  partici¬ 
pated  in  the  study,  each  case  was  read  by 
a  single  mammographer.  Notice  that  this 
assessment  is  not  the  same  as  the  BI¬ 
RADS  clinical  assessment.  Moreover,  this 
assessment  does  not  directly  correspond 
to  the  clinical  task  of  deciding  whether  a 
patient  should  be  referred  to  biopsy  or 
follow-up.  Since  all  the  cases  in  the  data 
set  were  subjected  to  biopsy,  the  mam¬ 
mographers  were  by  definition  perform¬ 
ing  with  100%  relative  sensitivity  and  0% 
relative  specificity  on  this  data  set  (PPV, 
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Figure  1.  ROC  curves  for  the  mammographers'  assessment  of  the  like¬ 
lihood  of  malignancy  in  the  cases  from  Duke  University  Medical  Center. 
The  mammographers'  assessment  was  more  accurate  for  masses  than  for 
calcifications.  FPF  =  false-positive  fraction,  TPF  -  true-positive  fraction. 


FPF 

Figure  2.  ROC  curves  for  the  BP-ANN  in  the  cases  from  Duke  Uni¬ 
versity  Medical  Center.  BP-ANN  was  more  accurate  for  masses  than 
for  calcifications.  FPF  =  false-positive  fraction,  TPF  =  true-positive 
fraction. 


508/1,453  =  35%).  (Notice  that  these  rel¬ 
ative  measures  are  not  indicative  of  the 
radiologists'  performances  over  a  general 
screening  or  diagnostic  mammography 
patient  population  in  which  most  actu¬ 
ally  benign  cases  are  correctly  referred  to 
follow-up.)  Nevertheless,  their  assessment 
of  the  likelihood  of  malignancy  is  useful  as 
an  approximation  to  an  internal  interme¬ 
diate  state  in  the  decision  process. 

Artificial  Neural  Network 

A  feed-forward  back-propagation  artifi¬ 
cial  neural  network  (BP-ANN)  can  learn  a 
function  mapping  inputs  to  outputs  by 
being  trained  with  cases  of  input-output 
pairs  (23-25).  The  network  inputs  were 
the  BI-RADS  features  and  patient  age.  The 
network  had  a  single  hidden  layer  and 
one  output  node  indicating  malignancy. 
Each  neuron  in  the  network  used  a  logis¬ 
tic  activation  function,  y  =  1/(1  +  e~x). 
The  BP-ANN  was  trained  to  minimize  the 
sum-of-squares  error  by  using  the  back- 
propagation  algorithm  (23-25).  A  binary 
variable  indicating  benign  or  malignant 
was  used  as  the  network  targets.  The  tar¬ 
get  values  were  clipped  to  0.1  and  0.9  to 
ensure  that  the  network  weights  re¬ 
mained  finite  (sigmoid  units  cannot  pro¬ 
duce  0  or  1).  The  network  weights  were 
updated  after  the  presentation  of  each 
case  (stochastic  gradient  descent),  which 


can  help  alleviate  the  problem  of  local 
minima.  A  momentum  term  was  used, 
which  can  also  help  the  network  escape 
local  minima.  The  training  cases  were 
presented  to  the  network  in  a  round- 
robin  (leave-one-out)  manner.  To  avoid 
overtraining,  network  training  ended 
when  the  average  testing  error  on  the  left- 
out  cases  began  to  increase  (early  stop¬ 
ping).  The  network  parameters  (learning 
rate,  momentum,  and  number  of  hidden 
nodes  in  the  single  hidden  layer)  were 
empirically  optimized.  The  custom  neu¬ 
ral  network  software  used  was  written  by 
members  of  our  laboratory  and  has  been 
used  in  several  previous  publications  (22). 

Linear  Discriminant  Analysis 

Linear  discriminant  analysis  (LDA)  was 
performed  on  the  data  collected  at  Duke 
University  Medical  Center.  LDA  is  a  com¬ 
mon  statistical  technique  for  linear  clas¬ 
sification.  The  same  input  findings  were 
used,  and  the  cases  were  used  in  a  round- 
robin  fashion  as  with  the  BP-ANN.  The 
LDA  was  computed  by  using  the  imple¬ 
mentation  in  SAS  software  (SAS  Institute, 
Cary,  NC). 

Receiver  Operating  Characteristic 

The  models  were  evaluated  in  terms  of 
their  receiver  operating  characteristic  (ROC) 


curves.  ROC  curves  enable  the  user  to 
evaluate  a  model  in  terms  of  the  trade¬ 
offs  between  sensitivity  and  specificity 
(26,27).  The  performance  of  classification 
methods  can  be  evaluated  by  directly 
comparing  their  ROC  curves  or  by  com¬ 
paring  indices  calculated  from  their 
curves.  The  most  commonly  used  index 
is  the  area  under  the  ROC  curve  ( Az ). 
Notice  that  the  values  for  Az  range  from 
0.5  for  chance  to  1.0  for  a  perfect  classi¬ 
fier. 

In  breast  cancer  diagnosis,  the  decision 
task  is  whether  to  refer  a  suspicious  case 
to  biopsy  or  recommend  follow-up  imag¬ 
ing.  A  true-positive  finding  would  be  an 
actual  cancer  that  was  correctly  referred 
to  biopsy.  A  true-negative  finding  would 
be  an  actual  benign  lesion  that  was  cor¬ 
rectly  recommended  for  follow-up  imag¬ 
ing.  The  cost  of  missing  a  cancer  (false¬ 
negative  finding)  far  outweighs  that  of  an 
unnecessary  benign  biopsy  (false-positive 
finding).  As  a  result,  we  were  most  con¬ 
cerned  about  the  high  sensitivity  region 
of  the  curve,  so  we  also  used  the  partial 
area  index  (090AZ')  calculated  on  that 
portion  of  the  curve  (true-positive  frac¬ 
tion,  0.9-1.0)  (28,29).  The  partial  area 
index  is  the  partial  area  normalized  such 
that  it  ranges  from  0.05  for  chance  to  1.0 
for  a  perfect  classifier.  ROC  analysis  was 
performed  by  using  software  modified  and 
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Figure  3.  ROC  curves  for  the  LDA  in  the  cases  from  Duke  Univer¬ 
sity  Medical  Center.  LDA  was  more  accurate  for  masses  than  for 
calcifications.  FPF  =  false-positive  fraction,  TPF  =  true-positive 
fraction. 
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Figure  4.  ROC  curves  for  the  BP-ANN  in  the  cases  from  the  Univer¬ 
sity  of  Pennsylvania  Medical  Center.  BP-ANN  was  more  accurate  for 
masses  than  for  calcifications.  FPF  =  false-positive  fraction,  TPF  = 
true-positive  fraction. 


provided  by  Charles  Metz  at  the  University 
of  Chicago.  The  modified  LABROC4  soft¬ 
ware  (maximum  likelihood,  semipara- 
metric  fit)  was  used  to  calculate  the  ROC 
curves  and  the  curve  indices,  Az  and 
090AZ'.  Statistical  comparisons  were 
made  with  use  of  a  standard  z  test  since 
there  was  no  correlation  between  the 
mass  and  calcification  cases.  A  P  value  of 
less  than  .01  was  considered  to  indicate  a 
statistically  significant  difference. 

RESULTS _ 

Duke  University  Medical  Center 

Mammographers'  assessment. — The  mam- 
mographers'  assessment  of  the  likelihood 
of  malignancy  (five-point  scale)  was  used 
as  a  decision  variable,  and  ROC  curves  were 
formed  for  masses  and  calcifications  sepa¬ 
rately  (Fig  1).  There  was  a  significant  differ¬ 
ence  (P  <  .01)  in  the  ROC  areas  for  the 
masses  (Az  =  0.94  ±  0.01)  compared  with 
that  for  the  calcifications  (Az  =  0.74  ± 
0.02).  There  was  also  a  significant  differ¬ 
ence  (P  <  .01)  in  the  partial  area  index  for 
the  masses  (0.90AZ'  =  0.62  ±  0.06)  versus 
that  for  the  calcifications  (0.90AZ'  =  0.17  ± 
0.04).  The  ROC  curve  over  all  of  the  cases 
was  intermediate  (Az  =  0.85  ±  0.01, 
o  90Az'  =  0.34  ±  0.04).  The  assessment  of 
the  mammographers  was  more  accurate 
for  the  masses  than  for  the  calcifications. 


Notice,  however,  that  the  actual  clinical 
performance  of  the  mammographers  was 
essentially  the  same  for  masses  (PPV  = 
223/615  =  36%)  and  calcifications  (PPV  = 
209/622  =  34%,  P  =  .65,  x2  test  for  inde¬ 
pendence;  95%  Cl  for  malignancy  frac¬ 
tion  =  -0.027,  0.080).  Notice  as  well  that 
since  each  case  was  read  by  a  single  mam- 
mographer  and  the  study  included  seven 
readers,  the  assessment  was  pooled  across 
mammographers. 

BP-ANN  performance. — The  BP-ANN  de¬ 
veloped  by  using  round-robin  sampling 
on  all  of  the  cases  from  Duke  University 
Medical  Center  also  performed  better  on 
the  masses  than  the  calcifications  (Fig  2). 
The  difference  in  the  ROC  area  for  the 
masses  (Az  =  0.93  ±  0.01)  and  that  for 
the  calcifications  (Az  =  0.63  ±  0.02)  was 
significant  (P  <  .01).  The  difference  in 
the  partial  area  index  was  also  significant 
(P  <  .01)  between  the  masses  (0.90 Az'  = 
0.62  ±  0.05)  and  the  calcifications 
(0.9 cAz  =  0-10  ±  0.02).  The  ROC  curve 
over  all  of  the  cases  was  intermediate 
(Az  =  0.82  ±  0.01,  0  90Az'  =  0.30  ±  0.03). 

Linear  discriminant  anaiysis. — The  round- 
robin  LDA  classifier  on  the  cases  from 
Duke  University  Medical  Center  also  per¬ 
formed  better  on  the  masses  than  on  the 
calcifications  (Fig  3).  There  was  a  signifi¬ 
cant  difference  (P  <  .01)  in  the  ROC  area 
for  the  masses  (Az  =  0.91  ±  0.01)  versus 


that  for  the  calcifications  ( Az  =  0.62  ± 
0.02).  The  difference  in  the  partial  area 
index  between  the  masses  (ovcyV  = 
0.61  ±  0.04)  and  that  for  the  calcifica¬ 
tions  (o.gcAz'  =  0.11  ±  0.02)  was  also 
significant  (P  <  .01).  The  ROC  curve  over 
all  of  the  cases  was  intermediate  (Az  = 
0.80  ±  0.01,  o  90Az'  =  0.28  ±  0.03). 

University  of  Pennsylvania  Medical 
Center:  BP-ANN 

The  BP-ANN  developed  by  using 
round-robin  sampling  on  the  cases  from 
the  University  of  Pennsylvania  Medical 
Center  also  performed  better  on  the 
masses  than  on  the  calcifications  (Fig  4). 
There  was  a  significant  difference  ( P  < 
.01)  in  the  ROC  area  of  the  masses  (Az  = 
0.88  ±  0.02)  compared  with  that  for  the 
calcifications  (Az  =  0.76  ±  0.02).  There 
was  also  a  significant  difference  (P  <  .01) 
in  the  partial  area  index  of  the  masses 
(o.swV  =  0.45  ±  0.05)  versus  the  calcifi¬ 
cations  (o.90 Az'  =  0.23  ±  0.04).  The  ROC 
curve  over  all  of  the  cases  was  intermedi¬ 
ate  (Az  =  0.82  ±  0.01,  090AZ'  =  0.34  ± 
0.03). 

DISCUSSION 


In  this  study,  the  performances  of  a 
breast  cancer  CAD  model  on  mass  and 
microcalcification  lesions  were  corn- 
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pared.  BP-ANN  and  LDA  models  were 
considered.  BP-ANN  analysis  was  re¬ 
peated  with  data  from  a  second  similar 
institution.  The  mammographers'  assess¬ 
ment  of  malignancy  was  also  investi¬ 
gated.  The  performance  on  masses  was 
consistently  better  than  the  performance 
on  calcifications  in  comparisons  involv¬ 
ing  radiologists,  CAD  models,  and  data 
from  two  institutions. 

A  BP-ANN  trained  in  a  round-robin 
fashion  on  a  heterogeneous  set  of  biopsy- 
proved  breast  lesions  was  found  to  per¬ 
form  significantly  better  on  masses  than 
calcifications  in  terms  of  the  ROC  area 
and  the  partial  area  index.  This  difference 
was  seen  with  use  of  two  data  sets  col¬ 
lected  at  different  institutions,  which  ar¬ 
gues  that  this  phenomenon  is  not  a  func¬ 
tion  of  a  particular  data  set.  A  similar 
difference  in  performance  on  masses  and 
calcifications  was  seen  when  another  pre¬ 
dictive  model,  LAD,  was  used.  Moreover, 
in  a  separate  study  conducted  at  Duke 
University  Medical  Center,  a  similar  dif¬ 
ference  in  performance  was  observed 
with  a  constraint  satisfaction  neural  net¬ 
work  (30).  This  indicates  that  the  ob¬ 
served  performance  differential  is  not 
specific  to  BP-ANN  models.  However,  it  is 
possible  that  if  some  other  classification 
technique  were  used,  such  differences 
would  not  be  observed  between  masses 
and  calcifications.  Finally,  when  the 
mammographers'  assessment  of  the  like¬ 
lihood  of  malignancy  was  used  as  a  deci¬ 
sion  variable,  it  was  found  that  they  too 
seemed  to  be  able  to  more  accurately  as¬ 
sess  the  masses  than  the  calcifications. 
Notice,  however,  that  there  is  no  corre¬ 
sponding  difference  in  their  clinical  rec¬ 
ommendations,  based  on  the  PPV  of  bi¬ 
opsy  for  those  two  subsets  of  cases.  Taken 
together,  these  findings  suggest  that 
masses  and  calcifications  should  be 
considered  separately  when  evaluating 
CAD  systems  for  breast  cancer  diag¬ 
nosis.  It  should  be  recalled  that  the 
"masses"  in  this  study  included  both 
calcified  and  noncalcified  masses  and 
that  the  presence  of  calcifications  in  ad¬ 
dition  to  a  primary  mass  lesion  may 
affect  the  classification  of  that  mass  by 
either  a  computational  technique  or  a 
mammographer. 

Recent  work  by  Huo  et  al  (14,31)  de¬ 
scribes  a  CAD  system  for  diagnosis  of 
breast  masses  that  handles  spiculated  and 
nonspiculated  masses  separately  and  is 
superior  to  a  CAD  system  that  was  devel¬ 


oped  on  a  mixture  of  spiculated  and  non¬ 
spiculated  masses.  The  work  described 
herein  can  be  interpreted  as  further  evi¬ 
dence  of  the  effect  of  distinct  subsets  on 
the  performance  of  the  breast  cancer 
CAD  models  for  diagnosis.  As  larger  da¬ 
tabases  become  available  for  developing 
CAD  models  for  diagnosis,  it  may  be  ben¬ 
eficial  to  develop  modular  systems  with 
submodels  that  are  specialized  for  subsets 
of  the  data.  Alternatively,  when  a  single 
CAD  model  for  diagnosis  is  developed 
over  a  heterogeneous  data  set,  such  as 
one  containing  both  mass  and  calcifica¬ 
tion  cases,  these  results  suggest  that  it 
would  be  appropriate  to  evaluate  the  per¬ 
formance  of  the  overall  model  over  the 
subsets  of  interest. 

Acknowledgments:  The  authors  thank  the 
members  of  the  breast  imaging  sections  at 
Duke  University  Medical  Center  and  the  Uni¬ 
versity  of  Pennsylvania  Medical  Center.  We 
also  acknowledge  Brian  Harrawood,  MS,  for 
scientific  programming. 

References 

X.  Ries  LAG,  Wingo  PA,  Miller  DS,  et  al.  The 
annual  report  to  the  nation  on  the  status  of 
cancer,  1973-1997,  with  a  special  section  on 
colorectal  cancer.  Cancer  2000;  88:2398- 
2424. 

2.  Feuer  EJ,  Wun  L,  Boring  CC,  Flanders  WD, 
Timmel  MJ,  Tong  T.  The  lifetime  risk  of  de¬ 
veloping  breast  cancer,  j  Natl  Cancer  Inst 
1993;  85:892-897. 

3.  Wun  L,  Merrill  RM,  Feuer  EJ.  Estimating  life¬ 
time  and  age-conditional  probabilities  of  de¬ 
veloping  cancer.  Lifetime  Data  Anal  1998; 
4:169-186. 

4.  Shapiro  S.  Screening:  assessment  of  current 
studies.  Cancer  1994;  74:231-238. 

5.  Henderson  1C.  Breast  cancer.  In:  Murphy  GP, 
Lawrence  W  Jr,  Lenhard  RE,  eds.  American 
Cancer  Society  textbook  of  clinical  oncology. 
Atlanta,  Ga:  American  Cancer  Society,  1995; 
198-219. 

6.  Kopans  DB.  The  positive  predictive  value  of 
mammography.  AJR  Am  J  Roentgenol  1992; 
158:521-526. 

7.  Doi  K,  MacMahon  H,  Katsuragawa  S,  Nish- 
ikawa  RM,  Jiang  Y.  Computer-aided  diagno¬ 
sis  in  radiology:  potential  and  pitfalls.  Eur  J 
Radiol  1999;  31:97-109. 

8.  Vyborny  CJ,  Giger  ML,  Nishikawa  RM.  Com¬ 
puter-aided  detection  and  diagnosis  of  breast 
cancer.  Radiol  Clin  North  Am  2000;  38:725- 
740. 

9.  Giger  ML.  Computer-aided  diagnosis  of 
breast  lesions  in  medical  images.  Comput  Sci 
Eng  2000;  2:39-45. 

10.  Liberman  L,  Abramson  AF,  Squires  FB,  Glass- 
man  JR,  Morris  EA,  Dershaw  DD.  The  Breast 
Imaging  Reporting  and  Data  System:  positive 
predictive  value  of  mammographic  features 
and  final  assessment  categories.  AJR  Am  J 
Roentgenol  1998;  171:35-40. 

11.  Karssemeijer  N,  Hendriks  JH.  Computer-as¬ 
sisted  reading  of  mammograms.  Eur  Radiol 
1997;  7:743-748. 

12.  Castellino  RA,  RoehrigJ,  Zhang  W.  Improved 
computer-aided  detection  (CAD)  algorithms 


for  screening  mammography  (abstr).  Radiol¬ 
ogy  2000;  21 7(P):400. 

13.  Chan  HP,  Sahiner  B,  Helvie  MA,  et  al.  Im¬ 
provement  of  radiologists’  characterization 
of  mammographic  masses  by  using  com¬ 
puter-aided  diagnosis:  an  ROC  study.  Radiol¬ 
ogy  1999;  212:817-827. 

14.  Huo  Z,  Giger  ML,  Vyborny  CJ,  Wolverton 
DE,  Schmidt  RA,  Doi  K.  Automated  comput¬ 
erized  classification  of  malignant  and  benign 
masses  on  digitized  mammograms.  Acad  Ra¬ 
diol  1998;  5:155-168. 

15.  Jiang  Y,  Nishikawa  RM,  Schmidt  RA,  Metz 
CE,  Giger  ML,  Doi  K.  Improving  breast  can¬ 
cer  diagnosis  with  computer-aided  diagnosis. 
Acad  Radiol  1999;  6:22-33. 

16.  Chan  HP,  Sahiner  B,  Lam  KL,  et  al.  Comput¬ 
erized  analysis  of  mammographic  microcal¬ 
cifications  in  morphological  and  texture  fea¬ 
ture  spaces.  Med  Phys  1998;  25:2007-2019. 

17.  Wu  Y,  Giger  ML,  Doi  K,  Vyborny  CJ,  Schmidt 
RA,  Metz  CE.  Artificial  neural  networks  in 
mammography:  application  to  decision 
making  in  the  diagnosis  of  breast  cancer. 
Radiology  1993;  187:81-87. 

18.  Baker  JA,  Kornguth  PJ,  Lo  JY,  Williford  ME, 
Floyd  CE  Jr.  Breast  cancer:  prediction  with 
artificial  neural  network  based  on  BI-RADS 
standardized  lexicon.  Radiology  1995;  196: 
817-822. 

19.  Kahn  CE  Jr,  Roberts  LM,  Shaffer  KA,  Hadd- 
awy  P.  Construction  of  a  Bayesian  network 
for  mammographic  diagnosis  of  breast  can¬ 
cer.  Comput  Biol  Med  1997;  27:19-29. 

20.  Floyd  CE  Jr,  Lo  JY,  Tourassi  GD.  Case-based 
reasoning  computer  algorithm  that  uses 
mammographic  findings  for  breast  biopsy 
decisions.  AJR  Am  J  Roentgenol  2000;  175: 
1347-1352. 

21.  American  College  of  Radiology.  BI-RADS: 
American  College  of  Radiology  Breast  Imag¬ 
ing  Reporting  and  Data  System  (BI-RADS). 
3rd  ed.  Reston,  Va:  American  College  of  Ra¬ 
diology,  1998. 

22.  Lo  JY,  Baker  JA,  Kornguth  PJ,  Floyd  CE  Jr. 
Effect  of  patient  history  data  on  the  predic¬ 
tion  of  breast  cancer  from  mammographic 
findings  with  artificial  neural  networks.  Acad 
Radiol  1999;  6:10-15. 

23.  Rumelhart  DE,  McClelland  JL,  ed.  Parallel 
distributed  processing:  explorations  in  the 
microstructures  of  cognition.  Cambridge, 
Mass:  MIT  Press,  1986. 

24.  Bishop  CM.  Neural  networks  for  pattern  rec¬ 
ognition.  Oxford,  England:  Oxford  Univer¬ 
sity  Press,  1995. 

25.  Hertz  J,  Anders  K,  Palmer  RG.  Introduction 
to  the  theory  of  computation:  Santa  Fe  Insti¬ 
tute  Studies  in  the  Science  of  Complexity. 
Redwood  City,  Calif:  Addison-Wesley,  1991. 

26.  Metz  CE.  Basic  principles  of  ROC  analysis. 
Semin  Nucl  Med  1978;  8:283-298. 

27.  Metz  CE.  ROC  methodology  in  radiologic 
imaging.  Invest  Radiol  1986;  21:720-733. 

28.  McClish  DK.  Analyzing  a  portion  of  the  ROC 
curve.  Med  Decis  Making  1989;  9:190-195. 

29.  Jiang  Y,  Metz  CE,  Nishikawa  RM.  A  receiver 
operating  characteristic  partial  area  index  for 
highly  sensitive  diagnostic  tests.  Radiology 
1996;  201:745-750. 

30.  Tourassi  GD,  Markey  MK,  Lo  J  Y,  Floyd  CE  Jr. 
A  neural  network  approach  to  breast  cancer 
diagnosis  as  a  constraint  satisfaction  prob¬ 
lem.  Med  Phys  2001;  28:804-811. 

31.  Huo  Z,  Giger  ML,  Metz  CE.  Effect  of  domi¬ 
nant  features  on  neural  network  perfor¬ 
mance  in  the  classification  of  mammo¬ 
graphic  lesions.  Phys  Med  Biol  1999;  44: 
2579-2595. 


Volume  223  -  Number  2 


Computer-aided  Diagnosis  of  Breast  Masses  and  Calcifications  -  493 


APPENDIX  3 


ELSEVIER 


Artificial 
Intelligence 
in  Medicine 

Artificial  Intelligence  in  Medicine  27  (2003)  113-127  . 

www.elsevier.com/locate/artmed 


Self-organizing  map  for  cluster  analysis  of 
a  breast  cancer  database 

Mia  K.  Markeya,b’*,  Joseph  Y.  Loa,b, 

Georgia  D.  Tourassib,  Carey  E.  Floyd  Jr.a,b 

“ Department  of  Biomedical  Engineering,  Duke  University,  Durham,  NC  27708,  USA 
b Digital  Imaging  Research  Division,  Department  of  Radiology, 

Duke  University  Medical  Center,  Durham,  NC  27710,  USA 

Received  10  May  2002;  received  in  revised  form  1  November  2002;  accepted  10  December  2002 


Abstract 

The  purpose  of  this  study  was  to  identify  and  characterize  clusters  in  a  heterogeneous  breast  cancer 
computer-aided  diagnosis  database.  Identification  of  subgroups  within  the  database  could  help 
elucidate  clinical  trends  and  facilitate  future  model  building.  A  self-organizing  map  (SOM)  was  used 
to  identify  clusters  in  a  large  (2258  cases),  heterogeneous  computer-aided  diagnosis  database  based 
on  mammographic  findings  (BI-RADS™)  and  patient  age.  The  resulting  clusters  were  then 
characterized  by  their  prototypes  determined  using  a  constraint  satisfaction  neural  network  (CSNN). 
The  clusters  showed  logical  separation  of  clinical  subtypes  such  as  architectural  distortions,  masses, 
and  calcifications.  Moreover,  the  broad  categories  of  masses  and  calcifications  were  stratified  into 
several  clusters  (seven  for  masses  and  three  for  calcifications).  The  percent  of  the  cases  that  were 
malignant  was  notably  different  among  the  clusters  (ranging  from  6  to  83%).  A  feed-forward  back- 
propagation  artificial  neural  network  (BP-ANN)  was  used  to  identify  likely  benign  lesions  that  may 
be  candidates  for  follow  up  rather  than  biopsy.  The  performance  of  the  BP-ANN  varied  considerably 
across  the  clusters  identified  by  the  SOM.  In  particular,  a  cluster  (#6)  of  mass  cases  (6%  malignant) 
was  identified  that  accounted  for  79%  of  the  recommendations  for  follow  up  that  would  have  been 
made  by  the  BP-ANN.  A  classification  rule  based  on  the  profile  of  cluster  #6  performed  comparably 
to  the  BP-ANN,  providing  approximately  25%  specificity  at  98%  sensitivity.  This  performance  was 
demonstrated  to  generalize  to  a  large  (2177)  set  of  cases  held-out  for  model  validation. 
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1.  Introduction 

There  is  considerable  interest  in  the  use  of  computational  techniques  to  aid  in  the 
detection  and  diagnosis  of  breast  cancer  [5,8,26].  Most  computer-aided  diagnosis  (CAD) 
studies,  including  this  one,  focus  on  mammography  since  it  is  the  primary  tool  for  the 
detection  of  breast  lesions  and  the  subsequent  decision  to  biopsy  suspicious  lesions.  The 
decision  to  biopsy  is  complicated  by  the  fact  that  breast  cancer  can  present  itself  in  a  variety 
of  ways  on  a  mammogram  and  there  is  considerable  overlap  in  the  appearance  of  benign 
and  malignant  lesions.  CAD  systems  for  the  decision  to  biopsy  that  are  based  on  findings 
extracted  by  radiologists  are  often  trained  and  evaluated  over  heterogeneous  databases  that 
reflect  this  variability  in  the  morphological  appearance  of  suspicious  breast  lesions 
[1,7,28].  We  have  recently  shown  that  a  CAD  tool  trained  on  such  a  heterogeneous 
database  can  perform  very  differently  on  two  broad  subgroups  which  constitute  most  of  the 
currently  biopsied  lesions:  masses  and  microcalcifications  [17].  In  particular,  we  observed 
that  the  performance  was  significantly  better  on  masses  than  on  calcifications. 

In  this  study,  we  used  a  self-organizing  map  (SOM)  [13]  to  identify  clusters  in  a 
heterogeneous  breast  cancer  CAD  database.  SOM  is  an  unsupervised  learning  method  that 
relates  similar  input  vectors  to  the  same  region  of  a  map  of  neurons.  To  the  best  of  our 
knowledge,  SOMs  have  not  been  used  to  identify  clusters  in  a  CAD  database  similar  to  the 
one  presented  here.  SOMs  have  been  used  for  other  tasks  in  breast  cancer  CAD  such  as  a 
benchmark  for  model  selection  [27]  and  to  predict  biopsy  outcome  [4], 

Once  the  SOM  was  used  to  identify  the  clusters,  a  constraint-satisfaction  neural  network 
(CSNN)  was  used  to  characterize  the  clusters  by  determining  a  profile  for  each  cluster. 
Briefly,  the  CSNN  is  a  Hopfield-type  network  of  neurons  arranged  in  a  non-hierarchical 
way  (Fig.  1 ).  There  are  symmetric,  bi-directional  weights  between  all  pairs  of  neurons  but 
there  are  no  reflexive  weights.  The  CSNN  operates  as  a  nonlinear,  dynamic  system  that 
tries  to  reach  a  globally  stable  state  by  adjusting  the  activation  levels  of  the  neurons  under 
the  constraints  imposed  by  the  a  priori  fixed  weight  values.  A  cluster  “profile”  provides  a 
description  of  a  “typical”  case  in  the  cluster.  We  have  previously  introduced  CSNN  for 
predicting  biopsy  outcome  and  as  a  data  mining  tool  for  breast  cancer  CAD  databases  [25]. 

A  feed-forward  back-propagation  artificial  neural  network  (BP-ANN)  is  a  classic 
technique  that  is  commonly  used  in  breast  cancer  CAD  systems.  Consequently,  a  BP- 
ANN  was  used  to  predict  the  biopsy  outcome  [2,10,21]  and  the  performance  of  the  BP- 
ANN  was  compared  on  the  clusters  identified  by  the  SOM  and  profiled  by  the  CSNN. 

A  clustering  algorithm  such  as  an  SOM  followed  by  a  cluster  characterization  method 
such  as  CSNN  profiling  could  serve  as  tools  in  the  initial  phases  of  a  divide-and-conquer 
approach  to  the  computer-aided  diagnosis  of  breast  cancer.  Both  modular  and  ensemble 
methods  could  be  used  for  a  divide-and-conquer  approach.  A  modular  system  uses  multiple 
classifiers  to  solve  a  classification  problem  by  partitioning  the  input  space  into  smaller 
domains,  each  of  which  is  handled  by  a  local  model  [24],  The  local  models  can  be  thought 
of  as  experts  for  a  particular  kind  of  case.  Ensemble  methods  are  resampling  schemes  in 
which  the  same  cases  are  used  in  training  multiple  experts,  whose  predictions  are  then 
combined  [24],  Such  approaches  may  be  justified  in  light  of  recent  results  in  this  field. 
Simple  ensembles  of  classifiers  using  voting  or  averaging  to  combine  their  predictions  have 
shown  promise  in  computer-aided  detection  of  breast  masses  [14,22,31].  Zheng  et  al. 
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Fig.  1.  Schematic  of  the  constraint  satisfaction  neural  network  (CSNN).  Notice  that  the  neurons  are  fully 
interconnected  with  no  reflexive  weights. 


employed  a  modular  scheme,  in  which  the  data  were  partitioned  by  a  difficulty  measure,  for 
computer-aided  detection  of  breast  masses  with  encouraging  results  [30],  Zheng  et  al.  also 
investigated  a  promising  ensemble  of  modular  models,  formed  by  taking  the  average  of  the 
predictions  from  modular  models  in  which  the  data  were  partitioned  using  three  features 
[29],  Huo  and  coworkers  described  a  modular  system,  in  which  the  data  were  partitioned  by 
a  spiculation  measure,  which  was  superior  to  a  general  image-based  computer-aided 
diagnosis  system  [11,12],  Finally,  we  have  recently  demonstrated  that  a  BI-RADS™- 
based  CAD  tool  built  on  a  heterogeneous  database  can  perform  very  differently  on  two 
broad  subgroups  of  lesions,  masses  and  microcalcifications  [17];  the  CAD  tools  inves¬ 
tigated  performed  better  on  masses  than  on  calcifications.  In  all  of  the  examples  listed  here, 
a  priori  knowledge  was  used  to  partition  the  data  into  subsets.  Unsupervised  learning  may 
provide  an  alternate  avenue  to  a  priori  knowledge  for  identifying  subsets  in  the  data  that 
should  be  handled  separately  in  the  development  or  evaluation  of  computer-aided 
diagnosis  or  detection  systems. 


2.  Materials  and  methods 

2.1.  Data 

Approximately  half  of  the  available  data  (4435)  were  used  for  model  development 
(2258)  in  this  study  in  order  to  withhold  the  remaining  data  for  additional  model  validation 
(2177);  the  data  were  randomly  partitioned  into  the  training  and  validation  sets,  but 
attention  was  paid  to  key  summary  statistics  such  as  the  fraction  of  cases  that  were 
malignant  in  each  set.  For  each  lesion,  the  benign  or  malignant  status  from  pathologic 
diagnosis  was  known.  The  overall  malignancy  fraction  was  43%.  In  the  next  few 
paragraphs,  we  describe  the  data  (2258)  used  for  model  development  in  greater  detail. 

The  first  data  set  consisted  of  75 1  non-palpable,  mammographically  suspicious  breast 
lesions  that  underwent  biopsy  (core  or  excisional)  at  Duke  University  Medical  Center  from 
1990  to  2000.  The  data  collection  procedures  have  been  previously  described  [16],  Briefly, 
expert  mammographers  described  each  case  using  the  breast  imaging  and  reporting  data 
system  (BI-RADS™)  lexicon  [20],  Each  of  the  cases  was  read  by  one  of  seven  readers. 
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When  a  lesion  could  be  described  by  multiple  descriptors  (e.g.  pleomorphic  and  punctate), 
the  mammographers  were  requested  to  report  the  descriptor  that  was  most  suspicious  for 
malignancy  (e.g.  pleomorphic).  Of  the  751  cases,  260  (35%)  were  malignant. 

The  second  data  set  consisted  of  501  mammographically  suspicious  breast  lesions  that 
underwent  excisional  biopsy  at  the  University  of  Pennsylvania  Medical  Center  from  1990 
to  1997.  The  data  collection  procedures  have  been  previously  described  [16].  Briefly,  each 
of  the  cases  was  read  by  one  of  1 1  expert  mammographers  who  described  each  case  using 
the  BI-RADS™  lexicon  [20].  When  a  lesion  could  be  described  by  multiple  descriptors 
(e.g.  pleomorphic  and  punctate),  the  mammographers  were  requested  to  report  the 
descriptor  that  was  most  suspicious  for  malignancy  (e.g.  pleomorphic).  Of  the  501  cases, 
200  (40%)  were  malignant. 

The  third  data  set  consisted  of  1006  biopsy-proven  breast  lesions  randomly  selected 
from  the  Digital  Database  for  Screening  Mammography  [9].  Expert  mammographers 
described  each  case  using  the  BI-RADS™  lexicon  [20].  Lesions  that  were  described  by 
multiple  descriptors  were  encoded  for  our  purposes  using  the  descriptor  that  was  most 
suspicious  for  malignancy.  Of  the  1006  cases,  522  (52%)  were  malignant. 

Specifically,  the  six  BI-RADS™  features  collected  describe  the  mass  margin,  mass 
shape,  calcification  morphology,  calcification  distribution,  associated,  and  special  findings. 
Missing  values  were  encoded  as  zero.  Each  BI-RADS™  feature  was  encoded  using 
uniformly  scaled  rank  ordered  categories  (Table  1).  For  example,  when  a  mass  is  present 
for  a  case,  the  mass  margin  can  take  on  one  of  five  values:  well  circumscribed  (1), 
microlobulated  (2),  obscured  (3),  ill-defined  (4),  or  spiculated  (5).  In  addition  to  the  BI¬ 
RADS™  features,  the  patient  age  was  collected,  for  a  total  of  seven  features. 

2.2.  Self-organizing  map 

A  self-organizing  map  relates  similar  cases  (input  vectors)  to  the  same  region  of  a  map  of 
neurons  [13].  The  SOM  was  computed  using  the  SOM  toolbox  in  MATLAB  '1  (The 
Math  Works  Inc.,  Natick,  MA).  The  basic  SOM  consisted  of  16  neurons  arranged  in  a  single 
layer  in  a  2-D  square  grid  of  4  x  4  neurons,  but  different  configurations  were  considered. 
For  each  case,  the  Euclidean  distance  between  the  case  and  each  neuron  was  calculated 
based  on  the  seven  input  features  (the  biopsy  outcome  was  not  provided  to  the  SOM).  For 
input  to  the  SOM,  each  feature  was  scaled  by  subtracting  the  mean  and  dividing  by  the 
standard  deviation,  resulting  in  each  scaled  feature  having  mean  zero  and  standard 
deviation  of  one.  After  the  most  similar  neuron  was  determined  the  neurons  in  its 
neighborhood  were  identified.  The  neighborhood  of  a  neuron  was  defined  as  all  the 
neurons  within  a  given  link  distance  of  the  matched  neuron.  All  the  neurons  in  the 
neighborhood  were  adjusted  to  have  feature  values  closer  to  the  current  case.  The  amount 
that  the  neuron  weights  were  adjusted  was  controlled  by  the  learning  rate.  The  learning 
rates  and  distance  threshold  values  used  were  the  default  values  for  the  SOM  toolbox. 

2.3.  Constraint  satisfaction  neural  network 

After  the  clusters  were  identified,  a  CSNN  was  used  to  determine  the  profiles  of  the 
clusters  [23,25],  Custom  software  in  the  C  language  was  used  to  implement  the  CSNN  and 
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has  been  previously  described  [25],  The  Lyapunov  energy  function  was  used  as  a  measure 
of  the  network  stability.  It  was  found  that  1000  iterations  were  sufficient  to  achieve 
stability.  The  weights  were  predetermined  using  autoassociative  backpropagation  neural 
networks  (auto-BP).  In  keeping  with  our  previous  work  [25],  the  auto-BP  networks  were 
trained  with  a  learning  rate  of  1 .0  for  100  iterations  and  the  root  mean  squared  training  error 
was  approximately  0.1  (network  outputs  between  0  and  1). 

For  each  cluster,  a  CSNN  was  used  to  generate  a  profile.  Each  category  of  the  categorical 
BI-RADS™  features  corresponded  to  a  binary  variable  and  associated  neuron.  For 
example,  the  mass  margin  with  its  five  non-zero  categories  was  represented  by  five 
separate  neurons.  Patient  age  was  translated  into  a  discrete  variable  with  five  levels  (<40 
years,  40  <  x  <  50,  50  <  x  <  60,  60  <  x  <  70,  >70  years)  [25].  An  additional  neuron 
was  used  to  signify  cluster  membership.  The  activation  level  of  the  neuron  indicating 
cluster  membership  was  set  to  the  maximal  value  and  the  other  neurons  were  allowed  to 
evolve  until  the  network  reached  a  stable  state.  The  feature  neurons  that  were  activated 
defined  the  profile  of  the  cluster.  A  profile  is  a  list  of  feature  values  that  succinctly 
summarizes  the  cluster  and  defines  a  “typical”  case  (e.g.  mass  margin  is  well  circum¬ 
scribed,  mass  shape  is  round,  and  patient  age  is  between  50  and  60  years).  All  cases  in  the 
cluster  do  not  exactly  match  the  profile;  there  is  still  a  distribution  of  feature  values.  Notice 
that  unlike  common  summary  statistics,  such  as  the  cluster  centroid,  the  CSNN  profile 
implicitly  includes  feature  selection;  only  features  deemed  relevant  to  the  network  for 
describing  a  cluster  are  included. 

2.4.  Back-propagation  artificial  neural  network  (BP-ANN) 

A  feed-forward  back-propagation  artificial  neural  network  (BP-ANN)  was  used  to 
predict  the  biopsy  outcome  from  the  mammographic  findings  and  patient  age.  The  BP- 
ANN  was  trained  to  minimize  the  sum-of-squares  error  using  the  back-propagation 
algorithm  [2,10,21].  The  network  had  a  single  hidden  layer  of  14  neurons  and  each 
neuron  in  the  network  used  a  logistic  activation  function.  The  network  inputs  (7)  were  the 
BI-RADS™  features  and  patient  age.  Network  inputs  were  rescaled  from  0  to  1  (by 
subtracting  the  minimum  value  and  dividing  by  the  maximum  minus  the  minimum).  The 
biopsy  outcomes  were  the  network  targets;  there  was  one  output  node  indicating  malig¬ 
nancy.  The  2258  cases  were  presented  to  the  network  in  a  round-robin  manner  (leave-one- 
out,  k-fold  cross-validation  with  k  —  N)  and  training  ended  before  the  average  testing  error 
on  the  left-out  cases  began  to  increase.  The  custom  neural  network  software  used  was 
written  in  C++  by  members  of  our  laboratory,  and  the  training  and  testing  process  has  been 
reported  previously  [15,17], 

2.5.  Receiver  operating  characteristic 

Receiver  operating  characteristic  (ROC)  curves  can  be  used  to  show  the  trade-off  in 
sensitivity  and  specificity  achievable  by  a  classifier  by  varying  the  threshold  on  the  output 
decision  variable  [18,19].  The  area  under  the  ROC  curve  is  often  used  as  a  measure  of 
classifier  performance.  In  evaluating  models  for  diagnosing  breast  cancer,  all  sensitivities 
are  not  of  equal  interest.  Only  techniques  that  perform  with  very  high  sensitivity  would  be 
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clinically  acceptable  since  missing  a  cancer  (false  negative)  is  generally  considered  much 
worse  that  an  unnecessary  benign  biopsy  (false  positive).  Thus,  particular  attention  was 
paid  to  the  specificity  at  98%  sensitivity. 

The  ROC  curves  were  calculated  non-parametrically.  P-values  and  standard  deviations 
on  the  specificity  at  98%  sensitivity  were  estimated  by  bootstrap  sampling  on  the  decision 
variable  [6], 


3.  Results 

Fig.  2  illustrates  the  arrangement  of  the  neurons  in  the  SOM.  The  set  of  cases  that  were 
mapped  to  a  neuron  defined  a  cluster.  Fig.  2  shows  the  number  of  cases  that  were  mapped  to 
each  neuron,  i.e.  the  number  of  cases  in  each  cluster.  The  fraction  of  the  cases  in  each 
cluster  that  were  malignant  is  also  shown  in  Fig.  2  (bottom  number  in  italics).  The 
malignancy  fraction  is  not  shown  for  the  clusters  with  fewer  than  10  cases  (#5,  12,  and  15), 
on  the  assumption  that  no  meaningful  conclusions  can  be  drawn  from  such  a  small  number 
of  cases.  Inspection  of  the  cases  mapped  to  these  clusters  (#5, 12,  and  15)  revealed  that  the 
cases  are  rare  for  this  database.  They  included  cases  with  findings  that  were  seen  with  a 
very  low  prevalence  in  the  set  (e.g.  special  finding  of  intramammary  lymph  node)  or 
reflected  incomplete  or  inconsistent  data  (e.g.  the  calcification  morphology  was  described 
but  calcification  distribution  feature  was  not  reported).  Together  these  three  clusters 
comprise  only  0.5%  of  the  cases.  Therefore,  no  further  analysis  was  performed  on  these 
clusters.  Recall  that  the  SOM  was  not  provided  with  the  biopsy  outcome  information.  The 
differences  in  the  malignancy  fraction  are  a  reflection  of  differences  in  the  BI-RADS™ 
features  and  patient  age  between  the  clusters.  Cluster  malignancy  rates  near  50%  do 
contain  some  information  since  the  overall  malignancy  fraction  was  43%.  Notice  that  there 
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Fig.  2.  Index  of  the  neurons  in  the  4  x  4  map.  Each  neuron  defined  a  cluster.  The  number  of  cases  that  were 
mapped  to  each  neuron,  i.e.  the  number  of  cases  in  each  cluster  (normal  type),  and  the  fraction  of  the  cases  in 
each  cluster  that  were  malignant  (italics)  is  shown.  Malignancy  fraction  data  not  shown  for  the  clusters  with  very 
few  cases.  Over  all,  43%  of  the  cases  were  malignant.  Information  regarding  the  main  features  of  the  cases  in 
each  cluster  is  shown  in  Figs.  4  and  5. 
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(a)  (b)  Neuron  in  3  x  3  map 
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(c)  (d)  Neuron  in  5  x  5  map 

Fig.  3.  (a)  The  index  of  the  neurons  in  the  3  x  3  map;  (b)  Comparison  of  the  clusters  identified  by  the  3  x  3  and 
4x4  SOMs.  For  each  case,  the  neuron  it  mapped  to  was  determined  for  each  SOM.  The  number  of  cases  for 
each  pair  of  clusters  between  the  two  SOMs  was  plotted,  the  size  of  the  circle  indicates  the  number  of  cases.  The 
more  large  bubbles  that  are  present  in  such  a  plot,  the  more  the  SOMs  agreed  on  the  clustering  of  the  cases. 
Linear  trends  (i.e.  bubbles  lining  up  along  the  diagonals)  indicate  that  the  same  cases  are  being  mapped  to  the 
same  region  in  the  two  SOMs;  (c)  The  index  of  the  neurons  in  the  5  x  5  map;  (d)  Comparison  of  the  clusters 
identified  by  the  5  x  5  and  4x4  SOMs.  For  each  case,  the  neuron  it  mapped  to  was  determined  for  each  SOM. 
The  number  of  cases  for  each  pair  of  clusters  between  the  two  SOMs  was  plotted;  the  size  of  the  circle  indicates 
the  number  of  cases.  The  more  large  bubbles  that  are  present  in  such  a  plot,  the  more  the  SOMs  agreed  on  the 
clustering  of  the  cases.  Linear  trends  (i.e.  bubbles  lining  up  along  the  diagonals)  indicate  that  the  same  cases  are 
being  mapped  to  the  same  region  in  the  two  SOMs. 

is  generally  a  higher  incidence  of  malignant  lesions  in  the  clusters  on  the  right-hand  side  of 
the  map. 

Fig.  3  shows  the  effect  that  changing  the  SOM  architecture  has  on  the  clusters  identified. 
Alternative  architectures  allow  one  to  vary  the  number  of  neurons  as  well  as  their 
topological  layout,  thus  potentially  allowing  for  variations  in  the  complexity  of  the  model. 
One  alternative  to  a  4  x  4  SOM  is  a  smaller  but  still  square  3x3  SOM  (Fig.  3a).  In  Fig.  3b, 
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Fig.  4.  The  cluster  profiles  generated  by  the  CSNN  for  the  clusters  identified  by  the  4  x  4  SOM  (cluster  number 
in  upper  right  comer).  A  cluster  “profile”  provides  a  description  of  a  “typical”  case  in  the  cluster.  Profiles  were 
not  computed  for  neurons  #5,  12,  and  15  which  had  very  few  cases  mapped  to  them.  The  percent  of  the  cases 
that  were  malignant  is  shown  in  the  lower  right-hand  comer;  refer  to  Fig.  2. 


the  clusters  of  the  3  x  3  and  4x4  SOMs  are  compared  using  a  bubble  plot.  For  each  case, 
the  neuron  it  mapped  to  was  determined  for  each  SOM.  The  number  of  cases  for  each  pair 
of  clusters  between  the  two  SOMs  was  plotted;  the  size  of  the  circle  indicates  the  number  of 
cases.  The  more  large  bubbles  that  are  present  in  such  a  plot,  the  more  the  SOMs  agreed  on 
the  clustering  of  the  cases.  Similarly,  Fig.  3c  and  3d  show  the  comparison  with  a  5  x  5 
SOM.  Linear  trends  (i.e.  bubbles  lining  up  along  the  diagonals)  indicate  that  the  same  cases 
are  being  mapped  to  the  same  region  (e.g.  upper  right-hand  area)  in  the  two  SOMs.  In 
addition  to  square  topologies,  other  layouts  were  also  investigated  which  utilized  approxi¬ 
mately  the  same  number  of  neurons.  Comparisons  were  made  to  a  2  x  8  SOM  and  to  a 
three-dimensional  SOM  of  2  x  3  x  3  neurons,  both  with  approximately  the  same  number 
of  neurons  as  the  4  x  4  square  SOM. 

For  the  4  x  4  SOM,  the  cluster  profiles  generated  by  the  CSNN  are  shown  in  Fig.  4.  Each 
cell  in  the  table  represents  the  feature  categories  that  were  dominant  or  most  strongly 
associated  with  the  cases  matching  that  cluster.  Profiles  were  not  computed  for  the  clusters 
with  very  few  cases.  The  mass  cases  are  distributed  over  neurons  #2,  3, 4,  6,  7,  and  8.  The 
profiles  of  neurons  #9,  13,  14,  and  16  indicate  that  those  clusters  contain  microcalcifica¬ 
tions.  Neuron  #l’s  profile  indicates  that  that  cluster  is  comprised  of  focal  asymmetric 
densities.  Note  that  the  profile  for  neuron  #10  includes  only  the  age  variable.  The  profile  for 
neuron  #1 1  reveals  that  the  lesions  in  that  cluster  are  architectural  distortions. 

An  alternative  approach  to  generating  cluster  profiles  is  to  compute  summary  statistics 
such  as  the  feature  mode  (or  mean  for  real-valued  features  such  as  age).  Fig.  5  shows  the 
mode  profiles  of  the  clusters  identified  by  the  4  x  4  SOM.  For  the  most  part,  there  is 
considerable  agreement  between  the  CSNN  and  mode  profiles.  Most  of  the  differences 
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Fig.  5.  The  cluster  profiles  generated  by  the  computing  the  mode  the  features  (mean  for  age)  for  the  clusters 
identified  by  the  4  x  4  SOM  (cluster  number  in  upper  right  comer).  A  cluster  “profile”  provides  a  description  of 
a  “typical”  case  in  the  cluster.  Profiles  were  not  computed  for  neurons  #5,  12,  and  15  which  had  very  few  cases 
mapped  to  them.  Features  for  which  the  mode  value  indicated  that  the  feature  was  absent  were  omitted  (e.g. 
mass  margin  =  no  mass).  The  percent  of  the  cases  that  were  malignant  is  shown  in  the  lower  right-hand  corner; 
refer  to  Fig.  2. 


correspond  to  adjacent  categories  in  the  features  (Table  1 )  where  the  CSNN  has  selected  the 
second  most  prevalent  value  for  the  profile.  However,  using  multiple  methods  to  summar¬ 
ize  the  clusters  may  be  beneficial.  For  example,  the  CSNN  profile  of  neuron  #16  (Fig.  4) 
does  not  include  any  mass  features  yet  the  feature  mode  profile  (Fig.  5)  shows  that  the  mass 
features  are  usually  non-zero.  In  fact,  inspection  of  the  cases  in  the  cluster  defined  by 
neuron  #1 6  reveals  that  they  are  calcified  masses.  Conversely,  the  CSNN  profile  for  neuron 
#10  (Fig.  4)  includes  only  the  age  variable  while  the  mode  profile’s  (Fig.  5)  inclusion  of 
values  for  the  calcification  variables  may  be  misleading  for  this  small  cluster  (N  =  29) 
where  there  is  little  dominance  by  any  single  value. 

A  BP-ANN  was  trained  to  predict  the  biopsy  outcome  from  the  BI-RADS™  features 
and  patient  age.  Fig.  6  shows  the  ROC  curve  for  the  BP-ANN.  The  SOM  can  also  be  used  to 
generate  a  malignancy  prediction  (4],  For  each  case,  the  prediction  was  the  fraction  of  the 
cases  that  were  malignant  in  the  cluster  that  the  case  was  mapped  to  by  the  SOM.  For 
example,  if  a  case  belonged  to  cluster  #4  in  which  83%  of  the  cases  were  malignant,  then 
the  classifier  output  for  that  case  would  be  0.83.  Notice  that  using  this  approach  limits  the 
number  of  operating  points  on  the  non-parametric  ROC  curve  to  the  number  of  clusters 
with  unique  malignancy  fractions  minus  one  (Fig.  6).  The  performance  at  the  highest 
sensitivities  was  comparable.  In  particular,  at  98%  sensitivity  the  SOM  operates  with 
0.26  ±  0.03  specificity  and  the  BP-ANN  operates  with  0.25  ±  0.03  specificity  (P  =  0.93). 

Fig.  7  lists  how  the  BP-ANN  trained  on  all  the  cases  performs  in  terms  of  the  BP-ANN’s 
recommendations  for  follow  up  instead  of  biopsy  on  the  subsets  identified  by  the  SOM.  A 
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False  Positive  Fraction 


Fig.  6.  ROC  curves  for  the  BP-ANN  and  the  SOM.  For  each  case,  the  prediction  from  the  SOM  was  the  fraction 
of  the  cases  in  the  cluster  it  belonged  to  that  were  malignant. 


threshold  was  applied  to  the  BP-ANN  outputs  such  that  the  overall  sensitivity  was 
approximately  98%  (965/982)  with  resulting  specificity  of  approximately  24%  (303/ 
1276).  In  other  words,  320  cases  (303  actual  negatives  and  17  actual  positives)  fell  below 
the  threshold.  These  320  cases  that  the  BP-ANN  would  have  recommended  for  follow  up 


Fig.  7.  Comparison  of  the  performance  of  the  BP-ANN  trained  on  all  the  cases  on  the  clusters  identified  by  the 
SOM.  For  each  cluster  the  number  of  true  negatives  (normal  type)  and  the  number  of  false  negatives  (italics)  is 
shown. 
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are  shown  in  Fig.  7  according  to  which  SOM  cluster  they  belonged.  Notice  that  there  is 
considerable  variability  in  the  performance  on  the  clusters.  In  particular,  the  majority  of  the 
cancers  that  the  BP- ANN  would  have  referred  to  follow  up  (11/17  =  65%)  and  the 
majority  of  the  benign  lesions  that  the  BP-ANN  would  have  spared  biopsy  (242/303  = 
80%)  were  in  the  cluster  defined  by  neuron  #6. 

These  interesting  results  with  the  cluster  defined  by  neuron  #6  suggested  that  a  simple 
rule-based  approach  could  be  valuable.  We  developed  a  classification  rule  based  on  the 
cluster  profiles  (Figs.  4  and  5)  of  neuron  #6  and  a  classification  and  regression  tree  (CART) 
[3]  model  for  mass  cases  using  the  implementation  in  S-PLUS®  (Insightful  Corp.,  Seattle, 
WA).  The  classification  rule  was:  if  the  mass  margin  was  well-circumscribed  or  obscured 
and  the  age  was  less  than  59  years  and  there  were  no  calcifications,  associated  findings,  or 
special  findings,  then  do  not  biopsy,  otherwise  do  biopsy.  On  the  2258  training  cases,  this 
rule  gave  961/982  =  98%  sensitivity  and  336/1276  =  26%  specificity.  In  other  words, 
this  rule  performed  comparably  to  the  BP-ANN  with  a  threshold  of  0.1842 
(965/982  =  98%  sensitivity,  303/1276  =  24%  specificity). 

The  performance  of  the  BP-ANN  and  the  classification  rule  developed  from  data  mining 
were  evaluated  on  the  2177  cases  withheld  for  model  validation.  On  the  validation  set,  the 
classification  rule  gave  886/904  =  98%  sensitivity  and  339/1273  =  27%  specificity  and 
the  BP-ANN  with  a  threshold  of  0.1842  gave  884/904  =  98%  sensitivity  and  296/1273  = 
23%  specificity.  Thus,  both  the  BP-ANN  and  the  rule-based  approach  generalized  and  they 
performed  comparably  at  this  high  sensitivity  point. 


4.  Discussion 

Considerable  variability  was  seen  in  the  fraction  of  the  cases  that  were  malignant  from 
cluster  to  cluster.  Several  clusters  had  malignancy  fractions  that  were  notably  different 
from  the  fraction  of  the  entire  data  set  (43%).  One  of  the  major  goals  of  computer-aided 
diagnosis  of  breast  cancer  is  to  identify  very  likely  benign  cases  as  candidates  for  follow 
up  in  lieu  of  biopsy,  in  order  to  reduce  the  number  of  benign  biopsies.  Therefore,  the 
clusters  with  very  low  malignancy  fractions  (e.g.  neuron  #6  with  6%  malignant)  are 
dominated  by  such  very  likely  benign  lesions  and  may  be  of  particular  interest  for  further 
studies.  It  is  possible  to  use  the  clusters  and  their  malignancy  fractions  directly  as  a  tool 
for  predicting  biopsy  outcome  [4],  For  each  case,  the  prediction  was  the  fraction  of  the 
cases  that  were  malignant  in  the  cluster  that  the  case  was  mapped  to  by  the  SOM  (Fig.  6). 
For  very  high  sensitivities,  this  prediction  scheme  (98%  sensitivity,  0.26  ±  0.03  speci¬ 
ficity)  was  competitive  with  the  back-propagation  artificial  neural  network  (98% 
sensitivity,  0.25  ±  0.03  specificity,  P  —  0.93);  however,  this  SOM-based  method  was 
not  superior  to  the  BP-ANN.  The  SOM  prediction  method  in  conjunction  with  the  CSNN 
profiling  method  has  the  potential  advantage  that  physicians  may  understand  the  intuition 
behind  it  better  than  they  do  the  BP-ANN,  which  is  often  seen  as  a  “black  box”.  The 
SOM  prediction  method,  similar  to  a  case-based  reasoning  system,  predicts  the  prob¬ 
ability  of  malignancy  of  a  new  case  by  reporting  the  fraction  of  similar  cases  that  were 
found  to  be  malignant  [7].  The  SOM  prediction  method  could  also  potentially  be  used  in 
an  ensemble  of  classifiers.  If  the  outputs  of  two  classifiers  are  not  strongly  correlated,  it  is 
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possible  that  they  could  be  combined  to  produce  a  classifier  that  is  better  than  either  of  its 
component  classifiers. 

The  effects  of  the  changing  the  SOM  architecture  were  investigated  (Fig.  3).  As 
indicated  by  the  presence  of  large  circles  in  the  bubble  plots,  the  SOMs  with  similar 
architectures  showed  substantial  agreement  in  clustering  the  data.  Moreover,  the  presence 
of  linear  trends  in  the  comparisons  with  the  5  x  5, 2  x  8,  and  2x3x3  SOMs  suggest  that 
similar  SOM  architectures  result  in  similar  geometric  relationships  between  clusters.  These 
data  argue  that  the  clustering  is  relatively  insensitive  to  the  SOM  architecture  for  this 
problem.  It  should  be  noted  that  this  study  did  not  focus  on  the  organization  of  the  clusters 
into  a  topological  map.  Consequently,  many  of  the  analyses  in  this  study  could  have  been 
performed  using  other  clustering  algorithms. 

Fig.  4  lists  the  CSNN  profiles  for  the  clusters  identified  with  the  SOM.  The  successful 
separation  of  a  priori  known,  coarse  lesion  types  (masses,  clustered  microcalcifications, 
focal  asymmetric  densities,  and  architectural  distortions)  provided  some  quality  assurance 
of  the  clustering.  Clusters  were  further  identified  within  the  general  group  of  mass  lesions, 
reflecting  different  combinations  of  the  mass  margin,  mass  shape,  and  patient  age 
variables.  The  cluster  profiles  that  included  calcification  features  showed  stratification 
of  the  general  group  of  calcification  lesions  only  by  patient  age  and  not  any  of  the 
calcification  findings.  Notice  that  while  some  features  may  not  be  considered  useful  by  the 
CSNN  for  profiling  individual  clusters,  it  is  possible  that  they  could  be  useful  to  other 
summarizing  techniques  or  to  methods  designed  to  describe  the  differences  between 
clusters. 

An  alternative  approach  to  characterizing  the  clusters  is  to  calculate  summary  statistics 
for  each  of  the  features.  Fig.  5  shows  the  mode  for  each  of  the  BI-RADS™  features  and  the 
mean  of  the  patient  age  for  each  cluster.  In  general,  there  is  good  agreement  in  the  cluster 
descriptions  obtained  from  these  summary  plots  and  the  CSNN  profiles.  However,  they  are 
not  identical.  The  most  notable  differences  are  for  neurons  #10  and  16,  which  show  the 
advantages  and  disadvantages,  respectively,  of  the  fact  that  the  CSNN  method  inherently 
includes  feature  selection. 

It  may  be  easier  to  interpret  a  CSNN  profile,  with  typically  only  a  few  dominant  features 
per  cluster,  than  to  interpret  as  many  summary  values  as  there  are  input  findings.  Note  as 
well  that  the  CSNN  takes  into  the  account  interdependencies  between  the  features,  while 
the  summary  statistics  were  based  on  each  feature  independently.  CSNN  profiles  or 
summary  statistics  can  be  used  to  quickly  sort  through  the  results  of  a  clustering  technique, 
but  additional  characterization  may  be  appropriate  for  clusters  of  particular  interest. 

Classification  based  on  the  SOM  was  competitive  to  that  achieved  by  the  BP-ANN  at 
high  sensitivity  levels  (Fig.  6).  Notable  variation  in  the  performance  over  the  clusters 
identified  by  the  SOM  was  observed  (Fig.  7).  This  is  consistent  with  our  previous  work 
demonstrating  performance  differences  with  an  a  priori  partitioning  of  the  data  into  two 
broad  subgroups  of  lesions,  masses  and  microcalcifications  [17]  and  suggests  that  further 
work  should  be  done  to  investigate  building  cluster-specific  models.  The  variation  in  the 
BP-ANN  performance  across  the  clusters  could  also  influence  the  ultimate  clinical 
implementation  of  the  decision  aid  since  it  may  not  be  useful  to  apply  the  BP-ANN  to 
cases  similar  to  those  groups  of  cases  for  which  it  always  recommended  biopsy  in  the 
training  set.  Interestingly,  the  SOM  identified  a  cluster  of  mass  cases  (#6)  which  accounted 
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for  the  majority  cases  that  the  BP-ANN  would  have  recommended  for  follow  up  rather  than 
biopsy.  Recall  that  the  identification  of  likely  benign  cases  that  could  be  spared  biopsy  is  the 
goal  of  such  computer-aided  diagnosis  schemes.  This  suggests  that  the  SOM  clustering  and 
CSNN  profiling  technique  could  be  used  to  provide  the  physician  with  an  alternative 
description  of  what  the  BP-ANN  does  for  certain  types  of  cases.  The  identification  of  a  single 
cluster  that  accounted  for  the  majority  of  the  cases  that  the  BP-ANN  would  have 
recommended  for  follow  up  also  suggests  the  investigation  of  rule-based  methods  to  identify 
relatively  simple  diagnostic  criteria  which  might  be  applied  to  these  cases  to  aid  the 
radiologists  in  their  decision  making  process.  Based  on  the  profiles  of  the  clusters  identified 
by  the  SOM,  we  developed  a  simple  classification  rule  that  performed  comparably  to  the  BP- 
ANN  (approximately  25%  specificity  with  98%  sensitivity).  Moreover,  we  demonstrated  that 
the  classification  rule  generalized  to  2177  cases  withheld  for  model  validation. 
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ABSTRACT 

We  developed  an  ensemble  classifier  for  the  task  of  computer-aided  diagnosis  of  breast  microcalcification  clusters, 
which  are  very  challenging  to  characterize  for  radiologists  and  computer  models  alike.  The  purpose  of  this  study  is  to 
help  radiologists  identify  whether  suspicious  calcification  clusters  are  benign  vs.  malignant,  such  that  they  may 
potentially  recommend  fewer  unnecessary  biopsies  for  actually  benign  lesions.  The  data  consists  of  mammographic 
features  extracted  by  automated  image  processing  algorithms  as  well  as  manually  interpreted  by  radiologists  according 
to  a  standardized  lexicon.  We  used  292  cases  from  a  publicly  available  mammography  database.  From  each  cases,  we 
extracted  22  image  processing  features  pertaining  to  lesion  morphology,  5  radiologist  features  also  pertaining  to 
morphology,  and  the  patient  age.  Linear  discriminant  analysis  (LDA)  models  were  designed  using  each  of  the  three  data 
types.  Each  local  model  performed  poorly;  the  best  was  one  based  upon  image  processing  features  which  yielded  ROC 
area  index  Az  of  0.59  ±  0.03  and  partial  Az  above  90%  sensitivity  of  0.08  ±  0.03.  We  then  developed  ensemble  models 
using  different  combinations  of  those  data  types,  and  these  models  all  improved  performance  compared  to  the  local 
models.  The  final  ensemble  model  was  based  upon  5  features  selected  by  stepwise  LDA  from  all  28  available  features. 
This  ensemble  performed  with  Az  of  0.69  ±  0.03  and  partial  Az  of  0.21  ±  0.04,  which  was  statistically  significantly 
better  than  the  model  based  on  the  image  processing  features  alone  (p<0.001  and  p=0.01  for  full  and  partial  Az 
respectively).  This  demonstrated  the  value  of  the  radiologist-extracted  features  as  a  source  of  information  for  this  task.  It 
also  suggested  there  is  potential  for  improved  performance  using  this  ensemble  classifier  approach  to  combine  different 
sources  of  currently  available  data. 

Keywords:  computer-aided  diagnosis,  breast  cancer,  BI-RADS,  image  processing,  ensemble  classifier 

1.  INTRODUCTION 


1.1  Clinical  significance 

Mammography  is  the  modality  of  choice  for  early  detection  of  breast  cancer.  Although  mammography  is  very 
sensitive  at  detecting  breast  cancer,  its  low  positive  predictive  value  (PPV)  results  in  biopsy  of  a  large  number  of  benign 
lesions.  Of  women  with  radiographically-suspicious,  nonpalpable  lesions  who  are  sent  to  biopsy,  only  15  to  34% 
actually  have  a  malignancy  by  histologic  diagnosis  [1,2].  The  excessive  biopsy  of  benign  lesions  raises  the  cost  of 
mammographic  screening  [3]  and  results  in  emotional  and  physical  burden  to  the  patients,  as  well  as  financial  burden  to 
society.  It  is  imperative  to  improve  the  specificity  of  breast  biopsy  by  identifying  probably  benign  lesions  for  short-term 
follow-up  instead  of  biopsy,  while  maintaining  the  very  high  sensitivity  of  cancer  detection  [4,5]. 

The  presence  of  clustered  microcalcifications  is  one  of  the  most  important  and  sometimes  the  only  sign  of 
cancer  on  a  mammogram  [6].  In  a  recent  study  from  this  institution,  radiologists  demonstrated  an  interesting  dichotomy 
in  performance  when  asked  to  assess  the  likelihood  of  malignancy  among  1468  nearly  consecutive  mammography  cases 
[7].  They  performed  significantly  better  as  measured  by  ROC  area  index  (Az)  for  the  mass  cases  (0.94  ±  0.01 )  compared 
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to  the  calcifications  (0.74  ±  0.02).  Similar  trends  were  observed  for  a  variety  of  statistical  and  machine  learning 
modeling  techniques  as  well.  Another  study  from  a  different  institution  reported  similarly  low  radiologists’  performance 
(Az  of  0.61)  for  104  nearly  consecutive  microcalcification  cases  [8], 

These  studies  indicate  there  is  tremendous  room  for  improvement  for  these  calcification  cases.  If  performance 
can  be  improved  for  these  challenging  cases,  overall  specificity  will  in  turn  be  dramatically  increased.  It  should  be  noted 
that  in  clinical  practice,  the  radiologist’s  task  is  to  recommend  whether  or  not  to  biopsy,  rather  than  predicting  an  explicit 
likelihood  of  malignancy  among  lesions  already  recommended  for  biopsy.  Nevertheless,  the  fact  remains  that  two-thirds 
or  more  of  currently  biopsied  cases  are  actually  benign.  In  order  to  improve  specificity  of  breast  biopsy,  the  additional 
challenge  becomes  to  identify  a  priori  more  very  likely  benign  cases  among  those  currently  referred  to  biopsy. 

1.2  Computer  aided  diagnosis 

It  is  important  to  distinguish  computer-aided  detection  vs.  computer-aided  diagnosis  or  classification.  For 
computer-aided  detection  (CAD),  a  suspicious  lesion  is  detected  and  localized  by  some  automated  computer  vision 
technique  such  as  those  in  the  academic  literature  [9-11]  or  one  of  the  currently  available  commercial  systems.  The  main 
goal  of  computer-aided  detection  is  to  improve  sensitivity  by  helping  radiologists  catch  disease  which  might  otherwise 
have  been  missed.  Once  the  lesion  has  been  detected  by  radiologists  and/or  some  computer-aided  detection  device,  a 
computer-aided  diagnosis  (CADx)  system  then  helps  the  radiologist  to  classify  that  lesion  or  to  make  a  patient 
management  decision.  The  main  goal  of  computer-aided  diagnosis  is  typically  to  help  improve  specificity,  such  as  by 
sparing  unnecessary  benign  biopsies. 

We  propose  a  classifier  to  aid  in  the  decision  task  of  whether  to  biopsy  a  suspicious  lesion  or  to  refer  the  case  to 
short-term  follow-up  surveillance.  Correct  diagnoses  have  the  following  implications:  very  likely  benign  cases  may 
undergo  follow-up  instead  of  biopsy,  while  the  remaining  indeterminate  cases  should  undergo  biopsy  for  confirmation 
by  histopathologic  diagnosis.  Incorrect  diagnoses  have  the  following  implications:  false  positive  errors  (benign  lesions 
misclassified  as  malignant)  may  result  in  an  unnecessary  biopsy,  while  false  negative  errors  (malignancies  misclassified 
as  benign)  may  result  in  delayed  diagnosis  of  an  actual  cancer.  Since  the  implications  for  false  negative  errors  far 
outweigh  those  of  false  positives,  CADx  systems  are  typically  evaluated  at  operating  points  corresponding  to  very  high 
sensitivity. 

1.3  Local  vs.  ensemble  models 

There  have  been  three  major  approaches  to  CADx  of  the  breast  biopsy  decision  task,  depending  on  the  source 
of  the  input  data.  The  first  is  to  employ  image  processing  techniques  which  extract  features  from  digitized  or  digital 
mammograms  [8,12-19].  These  fully-  or  semi-automated  CADx  systems  are  not  constrained  by  the  limits  of  human 
vision,  should  be  more  consistent,  and  have  the  potential  to  improve  the  performance  of  less  experienced  radiologists. 

The  second  approach  is  to  rely  upon  radiologists  to  interpret  the  images  and  manually  record  findings  deemed 
clinically  relevant  [20-26],  This  approach  draws  upon  the  a  priori  knowledge  of  these  radiologists,  who  can  characterize 
a  tremendous  amount  of  image  information  into  a  list  of  succinct,  useful  findings,  such  as  the  standardized  lexicon 
known  as  the  Breast  Imaging  Reporting  and  Data  System  (BI-RADS;  American  College  of  Radiology,  Reston,  VA) 
[27],  Moreover,  such  input  data  is  often  already  available  and  intuitively  meaningful  to  physicians,  which  may  facilitate 
eventual  clinical  acceptance  of  systems  based  on  such  data.  The  disadvantages  are  the  limits  of  human  vision  and 
knowledge,  as  well  as  potential  problems  arising  from  intra-  and  inter-observer  variability. 

A  third  approach  uses  additional  information  including  patient  history,  clinical,  or  demographic  data.  Such  data 
tend  to  correlate  less  well  with  disease,  and  can  often  be  very  subjective  in  quality  and  laborious  to  collect.  An  exception 
may  be  the  patient  age,  which  is  readily  available  and  was  identified  as  a  surprisingly  useful  adjunct  in  predicting 
malignancy  in  our  previous  work  [28], 

We  will  investigate  combining  these  three  sources  of  data  (image  processing,  BI-RADS,  and  history)  into  one 
ensemble  system.  An  ensemble  system  uses  multiple  classifiers  to  solve  a  classification  problem  by  training  multiple 


models  for  the  same  cases  and  then  combining  models’  predictions  [29].  Simple  ensembles  of  classifiers  using  voting  or 
averaging  to  combine  their  predictions  have  shown  promise  in  this  field  [30-33].  The  hypothesis  here  is  that  an  ensemble 
classifier  comprised  of  information  from  all  three  sources  of  data  can  significantly  outperform  models  based  upon  local 
subsets  of  data. 


2.  MATERIALS  AND  METHODS 


2.1  Database 

Until  recently  all  major  research  laboratories  reported  results  based  upon  private  databases.  The  performance  of 
an  algorithm  is  affected  by  the  characteristics  of  a  database  including  digitizer  choice,  pixel  size,  subtlety  of  cases, 
choice  of  training/testing  subsets,  and  the  number  of  cases  in  each  subset,  thus  making  it  almost  impossible  to  compare 
results  reported  from  different  research  groups  [34].  The  establishment  of  the  Digital  Database  for  Screening 
Mammography  (DDSM)  [35]  allows  the  possibility  of  common  training  and  testing  data  sets  for  the  first  time.  The 
DDSM  is  the  largest  publicly  available  database  of  mammographic  data.  It  contains  approximately  2000  screening 
mammography  cases  obtained  between  1988  and  1999  at  several  institutions  including  Massachusetts  General  Hospital, 
Wake  Forest  University  School  of  Medicine,  Sacred  Heart  Hospital,  and  Washington  University  in  St.  Louis  School  of 
Medicine. 

For  this  pilot  study,  we  specified  a  patient  selection  criteria  to  provide  a  reasonable  number  of  cases  while 
keeping  methodological  and  statistical  issues  as  simple  as  possible.  From  cases  with  definitive  pathology  outcome  (i.e. 
not  “unproven”  or  “benign,  no  call  back”),  we  randomly  selected  292  cases  which  were  digitized  by  the  Howtek  digitizer 
(which  had  the  most  cases  compared  to  the  other  2  types  of  digitizers).  Each  case  had  only  one  cluster  recorded  in  the 
truth  file,  and  that  cluster  was  successfully  segmented  by  our  existing  automated  detection  algorithm  which  has  been 
described  in  detail  previously  [36,37].  All  image  processing  was  performed  only  on  the  medio-lateral  oblique  (MLO) 
view  of  the  breast  containing  the  lesion,  thus  obviating  any  problems  due  to  per-case  vs.  per-patient  sampling  and 
performance  analysis. 

2.2  Computer-extracted  features 

For  each  case,  we  used  the  aforementioned  detection  technique  [36,37]  as  the  front  end  to  localize  clusters  and 
segment  individual  calcifications  within  those  clusters.  This  fully  automated  detection  scheme  consisted  of  three  main 
processing  steps: 

(1)  Pre-processing.  The  breast  region  was  segmented  and  its  high  frequency  content  was  enhanced  by  unsharp  masking. 

(2)  Segmentation  of  individual  calcifications.  Individual  microcalcifications  were  segmented  using  local  histogram 
analysis  on  small,  overlapping  regions  of  interest  (ROIs).  Each  histogram  was  modeled  as  a  possible  bimodal 
distribution  of  bright  calcifications  on  a  darker  background.  Histogram  features  were  extracted  and  then  merged  using  a 
back-propagation  artificial  neural  network  (BP-ANN)  classifier  [38-40]  to  determine  whether  each  ROI  contained  a 
calcification. 

(3)  Cluster  classification.  The  calcifications  were  clustered  using  a  nearest  neighbor  algorithm.  Features  were  extracted 
describing  each  cluster  and  then  merged  using  another  BP-ANN  classifier  to  reduce  the  number  of  false  positive  clusters. 

For  each  cluster,  22  image  processing  features  were  calculated.  These  consisted  of  the  number  of  calcifications, 
logarithm  of  that  number,  total  area  of  all  calcifications,  logarithm  of  that  area,  and  the  mean  and  standard  deviation  of 
each  of  the  following  nine  morphological  features:  calcification  distance,  number  of  overlaps  (resulting  from  the 
overlapping  ROIs  in  histogram  analysis),  calcification  area,  compactness,  central  moment,  Fourier  descriptor, 
eccentricity,  spread,  and  orientation. 


A  region  from  a  sample  case  containing  a  malignant  calcification  cluster  and  the  corresponding  detection  output 
are  shown  in  Figure  1. 


Figure  1 .  Sample  detection  output  for  malignant  cluster  (left:  easel  108,  left  CC,  cluster  outlined  by  experienced  radiologist,  right:  true 

positive  detected  cluster  bounded  by  rectangle). 


Table  1.  BI-RADS  mammographic  features  and  numeric  encoding 


Calc.  Distribution 

Mass  Margin 

no  calcifications 

0 

no  mass 

0 

diffuse 

1 

well  circumscribed 

1 

regional 

2 

microlobulated 

2 

segmental 

3 

obscured 

3 

linear 

4 

ill-defined 

4 

clustered 

5 

spiculated 

5 

Calc.  Morphology 

Mass  Shape 

no  calcifications 

0 

no  mass 

0 

milk  of  calcium-like 

1 

round 

1 

eggshell  or  rim 

2 

oval 

2 

skin 

3 

lobulated 

3 

vascular 

4 

irregular 

4 

spherical  or  lucent  centered 

5 

suture 

6 

Associated  Findings 

coarse 

7 

none 

0 

large  rod-like 

8 

skin  lesion 

1 

round 

9 

hematoma 

2 

dystrophic 

10 

post  surgical  scar 

3 

punctate 

11 

trabecular  thickening 

4 

indistinct 

12 

skin  thickening 

5 

pleomorphic 

13 

skin  retraction 

6 

fine  branching 

14 

nipple  retraction 

7 

axillary  adenopathy 

8 

architectural  distortion 

9 

2.3  Human-extracted  features 

For  each  case,  we  also  extracted  5  BI-RADS  features  and  the  patient  age  from  the  database.  The  BI-RADS 
features  were:  calcification  morphology,  calcification  distribution,  mass  shape,  mass  margin,  and  associated  findings. 
Note  that  the  two  mass  findings  occur  because  these  calcification  cases  were  defined  as  those  with  the  presence  of 
calcification  findings.  In  some  cases,  an  associated  mass  was  also  present.  The  text  labels  for  each  BI-RADS  feature 
were  translated  into  numeric  values  using  a  rank  ordering  system  shown  below  in  Table  1  which  we  have  used 
previously  in  developing  models  with  this  type  of  data  [28,41],  In  cases  where  a  feature  was  described  by  multiple 
values,  such  as  if  there  were  two  values  for  the  calcification  distribution,  the  greatest  value  corresponding  to  the  highest 
likelihood  of  malignancy  was  used. 

2.4  Statistical  Sampling  and  Measurements 

Due  to  the  relatively  low  number  of  cases  available,  all  modeling  was  performed  using  linear  discriminant 
analysis  (LDA)  using  SAS  software  (SAS  Inc.,  Cary  NC)  with  round  robin  sampling.  Az  and  partial  Az  above  sensitivity 
of  90%  were  calculated  using  LABROC4  and  compared  using  CLABROC  (both  modified  by  Charles  Metz,  University 
of  Chicago,  to  provide  partial  Az  calculations).  The  partial  Az  was  used  to  characterize  the  more  clinically  relevant  high 
sensitivity  sub-region  of  the  ROC  curve,  which  emphasizes  the  far  greater  cost  of  a  missed  cancer  compared  to  an 
unnecessary  benign  biopsy  [42]. 


3.  RESULTS 

The  results  are  summarized  in  Table  2  and  Figure  2  below.  Each  row  represents  the  performance  of  a  model 
based  upon  a  combination  of  one  or  more  of  the  three  sources  of  data,  which  have  been  color  coded:  A)  blue  for  image 
processing  features,  B)  pink  for  BI-RADS,  and  C)  green  for  the  sole  history  feature  of  age.  The  columns  labeled  as  A,  B, 
and  C  indicate  that  the  model  on  that  row  used  some  or  all  of  the  features  from  that  source  of  data. 

None  of  the  3  sources  of  patient  data  by  itself  provided  much  useful  information,  as  shown  in  rows  A,  B,  and  C 
all  with  Az  <  0.6.  There  were  however  interesting  improvements  when  these  data  were  combined  together.  For  example, 
on  row  D,  the  addition  of  just  age  (which  by  itself  performed  close  to  chance)  significantly  improved  performance  over 
the  5  BI-RADS  features  alone  in  row  B  (p<0.001  for  both  full  and  partial  A z).  On  row  E,  the  further  addition  of  the  22 
image  processing  features,  i.e.,  using  all  28  available  features,  did  not  improve  performance  compared  to  row  D.  On  row 
F,  when  stepwise  LDA  was  used  to  reduce  those  28  total  features  to  just  5,  however,  that  yielded  the  best  performance  of 
all  at  Az  of  0.69  and  partial  Az  of  0.21.  This  final  5-feature  model  is  significantly  better  than  using  only  the  22  image 
processing  features  (row  F  vs.  row  A)  for  both  full  and  partial  Az  (p<0.001  and  p=0.01  respectively).  The  final  5-feature 
model  was  not  however  significantly  better  the  6  human-extracted  features  (row  F  vs.  row  D)  for  either  full  or  partial  Az 
(p=0.07  and  p=0.15  respectively).  Those  final  5  features  were  (in  order  of  descending  significance):  BI-RADS 
calcification  distribution,  mean  central  moment,  mean  eccentricity,  BI-RADS  mass  margin,  and  BI-RADS  calcification 
morphology. 


Table  2.  Performance  of  LDA  models  with  different  feature  combinations  from  292  DDSM  cases. 


Feature  combination 

B 

c 

Az 

Partial  A7 

A)  22  image  processing  only 

0.59  ±0.03 

0.08  ±  0.03 

B)  5  BI-RADS  by  radiologist  only 

r 

f 

0.58  ±0.03 

0.07  ±0.02 

C)  Age  alone 

1 

0.52  ±0.03 

0.05  ±  0.02 

D)  All  6  human  extracted  (5  BI-RADS  +  age) 

■ 

l 

IIS 

0.66  ±0.03 

0.13  ±0.03 

E)  All  28  features  (22  image  +  5  BI-RADS  +  age) 

0.65  ±  0.03 

0.11  ±0.03 

F)  Stepwise  selection  of  top  5  features  from  all  28 

0.69  ±  0.03 

0.21  ±  0.04 

The  ROC  curves  for  rows  A,  D,  and  F  are  plotted  below  in  Figure  2.  As  described  above,  row  A  with  the  image 
processing  features  only  performed  poorly  (shown  by  red  line),  and  the  final  ensemble  including  contributions  from  BI- 


RADS  features  improved  that  performance  significantly  (row  F  shown  by  gold  line).  The  human  extracted  features  only 
from  row  D  were  intermediate  between  those  two  curves. 


Figure  2.  ROC  Curves  for  Ensemble  vs.  Local  Models 


4.  DISCUSSION 

These  results  suggest  several  interesting  trends.  Local  models  based  upon  these  image  processing  features  or 
the  BI-RADS  features  each  performed  comparably  (in  fact  comparably  poorly).  The  addition  of  age  to  the  BI-RADS 
features  significantly  improved  performance,  which  is  consistent  with  our  previous  experience  with  these  human-only 
models.  The  further  addition  of  image  processing  features  improved  performance  even  further,  albeit  not  significantly. 
That  may  change  with  either  more  cases  or  better  image  processing  features.  For  example,  there  are  many  other 
morphological  features  not  used  here,  as  well  as  several  different  categories  of  texture  features  which  have  been  shown 
to  be  very  useful  for  this  particular  task  [14]. 

Intriguingly,  the  feature-reduced  model  did  not  include  age  as  one  of  its  remaining  features.  Apparently  the 
significance  of  age  was  much  decreased  in  the  presence  of  these  image  processing  and  BI-RADS  features,  a  fact  that 
warrants  further  investigation.  The  final  ensemble  model  based  upon  2  image  processing  and  3  BI-RADS  features  did 
significantly  outperform  one  based  upon  just  the  22  available  image  processing  features,  supporting  once  again  the  value 
of  these  BI-RADS  findings  in  this  decision  task. 

It  should  be  noted  that  although  the  trends  support  the  value  of  building  ensemble  models  for  this  data,  these 
ROC  performance  values  were  still  quite  poor.  The  best  Az  was  only  0.69  and  the  best  partial  Az  0.21,  corresponding  to 
the  average  specificity  over  the  range  of  sensitivities  from  0.90  to  1.00.  Given  the  equally  poor  performance  of 
radiologists  for  calcification  cases,  however,  there  is  great  potential  for  improvement.  In  the  end,  the  most  important  test 
in  the  future  will  be  to  assess  whether  radiologists  can  use  such  models  to  improve  their  clinical  performance. 
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